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THE EVALUATION OF THE MATCHING METHOD 


P. E. VERNON* 














The Maudsley Hospital, London, England 


I. INTRODUCTION 


The method of correct matchings is of considerable importance 
in the investigation of personality, in that it offers a means for com- 
paring complex wholes or Gestalten, in contradistinction to correla- 
tional methods which are generally applied only to the comparison 
of quantitative continua—either unidimensional variables or com- 
posite constructs of variables. For example, in determining whether 
personality can be judged from facial expression, the conventional 
correlational method is as follows. Photographs of a group of Sub- 
jects are presented to a number of Judges (who are unacquainted 
with the Subjects); the Judges are required to rate the photographees 
on a series of separate traits such as intelligence, sociability, humor, 
etc. The Subjects are also rated on the same traits by close acquaint- 
ances, and the two sets of ratings on each trait are then inter-correlated. 
The resulting coefficients are usually very small, and the conclusion 
is drawn that scarcely anything can be deduced about personality from 
facial expression. In a matching experiment, however, short case 
studies or sketches are prepared, describing the personalities of, say, 
five Subjects. These sketches are then presented to a number of 
Judges together with photographs of the Subjects, the latter being 
numbered in a different order from the former. The Judges are 
required to identify or match the two sets of data, 7.e. to say which 
sketch belongs to which photograph. It is usually found that they 
can do so with a considerable degree of success, for here they have 





* This article was written during the author’s tenure of the Pinsent-Darwin 
Studentship in Mental Pathology. The author is extremely gratefuJ to Mr. J. O. 
Irwin, Dr. D. W. Chapman and Dr. W. F. Floyd for their criticisms and advice, 
but would add that none of these statisticians must be held responsible for the 
arguments put forward. 
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the opportunity of dealing with individual personalities as unique 
wholes—a much more natural proceeding than judging an isolated 
trait in a group of personalities. 

The method possesses a variety of other applications which are 
described elsewhere.* Probably it would have already obtained wide 
usage but for the lack of a convenient method for evaluating the 
accuracy ofitsresults. It is obvious that the average Judge can always 
get one of his matchings correct by chance, so that any number greater 
than one establishes some relationship between the data that are 
matched. Those Continental psychologists (e.g. Binet, Arnheim, 
Bobertag, Wolff) who are chiefly responsible for developing the method 
have generally been content merely to state the percentage by which 
their numbers of correct matchings exceed this chance probability. 
Contemporary investigators, however, demand to know the statistical 
significance of such an excess percentage. Let us therefore examine 
the various statistical treatments that have been applied. 


II. THE STATISTICAL PROBABILITY OF THE RESULT OF A MATCHING 
EXPERIMENT 


We will adopt throughout the following symbols from Chapman’s 
recent study of the method. 


t = the number of elements in a series which is to be matched. When ¢ photo- 
graphs are matched with ¢ personality sketches, the experiment may be 
denoted as #:t. 

s = the number correctly matched by any one Judge; § = the average number 
correct for all the Judges. Since it is often more convenient to deal with 
the total proportion of correct matchings instead of with their number, let 
S be this proportion, so that 100 X S is the percentage of correct match- 
ings, and St = 8. 

n = the number of Judges; N = the total number of judgments. In ordinary 
matching experiments N = nt; but there are several other types of match- 
ing, described below, where this does not hold. 


The problem was first discussed by Montmort in 1708.* From 
his analysis it follows that the probability of obtaining exactly s correct 


, _ 2-2 1 1 1 (— 1)**l. : 
matchings, P, = Mn ait a Bi + see Hb iF} This 
formula soon approximates to P, = e—!/s!. The probability of 
obtaining s or more correct matchings is given by P, + Pai + - °° + 


P,;. Chapman’ gives Tables of the values of both these probabilities 
for s = 0 to 7, up to four places of decimals. 





*Cf. Todhunter’ (pp. 91-93). Chapman also provides a derivation of this 
equation. 


te eae Bh 
<3 


ma &S FS fF Ow tl CUrSlLCU 


ate 


— * - ee ee 8) me TP 





jue 
ted 


are 
ide 
the 
aYS 
ter 
are 


10d 
ich 
ity. 
ical 


ine 


YG 


ins 


oto- 
r be 


iber 
vith 
, let 
tch- 


lary 
tch- 





The Evaluation of the Matching Method 3 


The probability that n judges will obtain s correct is, of course, 
much smaller than the probability of one judge obtaining s correct; 
hence neither of the above formulae is applicable to matchings by 
groups of judges.* But Chapman has worked out the true proba- 
bilities under such circumstances, and his Table VII gives P to three 
decimal places for all values of 3, when ¢ is equal to or greater than four, 
and when n ranges from two to thirty. Though this Table gives the 
only correct solution of our problem, it is at present too restricted in 
range to cover most of the published matching results, or the results 
likely to be published in future; their probabilities are generally vastly 
smaller than .001. Many of them employ more than thirty judges, 
and a ¢t of two or three is quite commonly adopted. The Table could, 
of course, be extended by means of Chapman’s general statistical 
solution. But his method, which is very complex, involves the use 
of Salvosa’s Tables for Type III curves, which themselves only extend 
to six decimal places. 

Thus, from the point of view of practical psychological investi- 
gation, the evaluation of a matching result by Chapman’s method is 
not satisfactory, since the method does little more than demonstrate 
whether or not such a result is significantly superior to chance proba- 
bility. The psychologist wishes to know the degree of validity; or 
he wishes to be able to compare the relative success of a number 
of different matching experiments, each of whose probabilities are 
enormously superior to chance. He would therefore prefer, if possible, 
a method which yields a difference, or better still a coefficient, with its 
standard error. 


III. METHODS BASED ON THE NORMAL DISTRIBUTION CURVE 


The objection to methods which involve the SE of a difference 
is that they assume a normal distribution, whereas Chapman has 
shown that the chance distribution of values of § is skewed and lepto- 
kurtic, approximating to a Type III curve. Chapman’s article gives 
an example of this curve for ¢ = 10 and n = 5; its shape is confirmed 
by the results from a statistical experiment with cards. Here the 
median value of 3 is 1.0; but in most matching experiments § is dis- 
tinctly greater than 1.0 (though much less than ¢). Hence the degree 
of skewness is generally not very large, and a normal curve does fit 
fairly closely. 





*In their work on group matching of handwriting with personality, Allport 
and Vernon? incorrectly used the first of these formulae. 











4 The Journal of Educational Psychology 


In order to obtain statistical data whose average 3 was greater 
than 1.0, the following experiment was carried out. Six balls, ranging 
in size from about 0.3 inch to 0.6 inch were simultaneously rolled down 
a plane surface some twenty-two inches long by three inches wide, 
which sloped at twelve degrees to the horizontal. The plane wag 
irregularly studded with protruding nails so that the larger balls were 
more frequently checked in their run, and there was a distinct tendency 
for the smaller balls to reach the bottom first. The order of arrival 
at the bottom was recorded, and a successful match was counted when- 
ever a ball arrived in its own rank position, 7.e. when the smallest ball 
was first, the largest last, and soon. In three thousand such runs the 
average value of s was 1.630 out of six. When the runs were grouped 
into six hundred sets of five, the distribution was nearly symmetrical 
and the empirical value of o; was 0.5366. For example, ninety-nine 
out of the six hundred sets, or 16.5 per cent gave § = 1.0 or less, 
and sixty-eight sets or 11.33 per cent gave § = 2.0 or more; whereas 
in a truly normal distribution with o = 0.5366, the proportions to be 
expected in the tails according to the Kelley-Wood tables would be 
16.15 per cent and 10.56 per cent. Testing for goodness of fit we find 
x? = 15.5 when Pearson’s n’ = 14; P = 0.275. In other words, the 
discrepancies between a normal curve and our obtained distribution 
might occur as often as two hundred seventy-five times in one thousand 
by chance. 

It would therefore seem justifiable in actual practice to determine 
o; for the result of a matching experiment by calculating o, from the 
obtained values of s; then o; = o,/+/n. In the above experiment 
the three thousand values of s gave a SE of 1.222, hence oa; = 
1.222/+/5 = .5465, which agrees quite closely with the empirical 
finding. Even when § = 1.0, as in Chapman’s data, so that o, = 1.0, 
the approximation is good; for o./+/n = 1/+/5 = .4472, and the 
empirical value of o; as calculated from the distribution of Chapman’s 
experimental results seems to be .4418.* 

Though this method is simple and convenient, and applicable to 
the whole range of matching experiments, it seems to be somewhat 
innaccurate according to Chapman’s results. From his Table VII 
it appears that the probability that § will be 1.63 or greater (when 
n = 5) is about .057. But in our experiment the difference between 
1.63 and 1.0 corresponds to a probability of .165. However the agree- 





* Chapman does not tabulate his actual results, hence they have been read off, 
as closely as possible, from his graph. 
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ment is better when n is larger; if our data are grouped into three 
hundred sets of ten runs, the probability is .058, and according to 
Chapman’s Table the figure should be about .041. 

In certain published matching investigations, another method 
depending on the SE of a difference has been applied. Allport and 
Cantril! and Gahagan® determined the difference between the obtained 
percentage of correct matchings and the expected or chance percentage, 


1.€. 100( s _ ) and then divided this figure by the PE of a percentage, 


.6745+/ pq/n, in order to find its reliability. It seems that Allport 
and Cantril took p as the obtained percentage, whereas Gahagan took 
it as the expected percentage, so that their respective formulae for the 


PE were: 
sa-s t—1 
67.454) ( WN ) and 67.45 WE 


If instead we assume that PE gi, = ~/ PE sts. pe- cont + PE “exp. per cont 
then we shall have three formulae, yielding three different reliabilities. 
Applied to the data of our experiment where the difference between 
the obtained and expected values of S per cent was 27.17 — 16.67 = 
10.5, the three values of the SE (when n = 10 and N = 60) are 5.743, 
4.811 and 7.492, as compared with the empirically determined value 
of 6.093. The probabilities of the difference are then .034, .015 and 
.080, Chapman’s accurate figure being .041. Though Allport and 
Cantril’s method (in this and in other concrete experiments) does 
give results which are fairly close to Chapman’s, none of these methods 
are really justifiable because the judgments in matching are not all 
independent. 











IV. METHODS BASED ON x? AND CONTINGENCY 


Another type of treatment, which dispenses with the assumption 
of normality, is based on Pearson’s x? test for Goodness of Fit. The 
obtained values of s in any matching experiment can be compared 
with the following chance distribution, and x? calculated from 


a fo fe)? 


100 j The probability can then be found from x? Tables 
8 J. Pern Cent 
A ee ee eee ee ee ep ie een e sa ee he ke er 36.79 
Ds tans 4 4 > ob bake aie wee eS CC ee were 
NT 4h 45's a6 a Oo 6k CS A ee One ee ee ee 18.39 
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where Pearson’s n’ = 5. Applied to the data of our experiment with 
balls (n = 10), x? = 4.873 and P = .302. The flaw in this method 
probably lies in the fact that most of x? is made up of deviations of 
s = 4 or more from 1.9 per cent; and since the number of these bigger 
deviations is unreliable, x? and P are likely to be innaccurate. If 
s = 3 and s = 4 or more are grouped together, the method becomes 
too coarse. 

Zubin advocates the mean square contingency coefficient for 
expressing the validity of matching.'° In its ordinary form, however, 
it is unsuitable since the size of C depends upon the deviations from 
the independence value (or chance frequency) in all the cells and not 
only upon the correct deviations. The following Table from Allport 
and Vernon? illustrates this objection. 


TaBLE I.—Matcuinc THUMBNAIL PERSONALITY SKETCHES WITH HANDWRITING 




















RECORDS 
Handwriting Personality sketches 
records 
D A Cc B 
1 (14.0) 5.5 0.0 55 
2 3.1 (5.5) 16.4 0.0 
3 5.5 14.0 (4.7) 0.8 
4 2.4 0.0 3.9 (18.7) 

















The percentages of correct matchings are here given along the diagonal, 
in parentheses, and it will be seen that they are sometimes exceeded 
by percentages in incorrect cells. Thus C, which is 0.69 out of a 
maximum possible 0.866, represents the consistency of the Judges’ 
matchings, not their correctness. 

This difficulty can be eliminated by the following modification. 
Let the deviations in all the correct cells be averaged (yielding 10.75 
per cent in the above Table), and also the deviations in the incorrect 
cells (yielding 4.75 per cent). Any large frequencies in the incorrect, 
or small frequencies in the correct, cells will now reduce and not 
increase the size of x? and C. For the data in Table I, C = 0.38 out 
of 0.866 by this method. The average frequency in correct cells is 
NS/t, and in the incorrect cells N(1 — S)/t(t — 1); the independence 
value is N/t?. Thus 
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, . N(St— 1)? _ N(@— 1)? 
. 3 ae 





The probability of any such value of x? can be determined from a 
Table which has been published by Yule.* If we apply this method 
to the data of the experiment with balls, x? = 4.763 and P = .029. 
Thus the method seems to yield (in this and in other concrete examples 
worked out by the present writer) probabilities of the same order as 
Chapman’s accurate figures. 

The corresponding contingency coefficients may be calculated from 
the formula: 


Cores "Vite" VSD ea 


G- a 
C= Eps = 37 
For the same experimental data, C = 0.271 out of a maximum possible 
0.913. 
The determination of the SE, i.e. the reliability, of such a con- 


tingency coefficient is not easy. By substituting in Pearson’s formula 
we find :T 


, Je +1-¢? 
“" VNV I+ 8 — 
Sod | = — 1)[(¢ — 1)? + 1) + {(¢ — 1) — (St — 194] 
VN t((¢ — 1) + (St — 1)*f 
This formula, however, is based on certain assumptions which may 
not be legitimate owing to our modification of the mean square con- 
tingency method. In particular it assumes that terms of the order 
1/N?, 1/N%, etc. are negligible, and that the obtained matching con- 
tingency table is a random sample of N judgments from a population 
with specified proportions in the different categories. The latter 
condition is infringed because the judgments are not all independent; 
the final judgment in an ordinary ¢:t matching experiment is usually 
reached by a process of elimination. (We will return to this point 
below). However, in spite of these uncertainties, the following 
evidence from the experiments with running balls does seem to prove 
the adequacy of the formula for o-. 





























* Yule’s Table,® not Pearson’s, must be used because n’ = 2 for any value of 
t, there being only one degree of freedom in our modified frequency distribution. 
} § may be inserted instead of St throughout, if preferred. 











8 The Journal of Educational Psychology 


The three hundred values of § for ¢ = 6 and n = 10 were expressed 
as contingency coefficients whose SE was found to be .1410. The 
theoretical SE according to our formula (when N = 6 X 10) was 
.1396. Similarly when grouped into one hundred fifty sets of twenty 
runs the empirical and theoretical SE’s were .0987 and .0974. In 
other experiments seven hundred twenty runs with the six balls and 
seven: hundred twenty more with four balls were grouped into suc- 
cessive sets of five runs; in the former the empirical and theoretical 
SE’s were .1928 and .1952, and in the latter .2234 and .2236. Even in 
Chapman’s experiment with cards (¢ = 10, n = 5, § = 1.0, C = 0.0) 
the SE of his distribution in terms of contingency coefficients was 
found to be approximately .142,* and the theoretical o. according to 
the formula is 1/>/N = .1414. In all these cases the discrepancies 
between theoretical and empirical values are statistically insignificant. 

The objection to our formula for the SE, based on the lack of 
independence of judgments, might be met by making N = n(¢ — 1) 
instead of nt, and by omitting from consideration each Judge’s final 
match. Thus a Judge who matched six photographs with six sketches 
would be regarded as having made only five judgments, there being no 
option about his final judgment. This procedure was also tried out 
with the three hundred sets of ten runs of the six balls, the numbers of 
times that the sixth ball arrived in the sixth position being omitted. 
S per cent was 26.01, giving C = 0.2432 and the empirical SE of the 
three hundred coefficients was .1607. The theoretical SE, when 
N = n(t — 1) = 50, was .1539. The agreement here is not quite so 
good; the difference between the SE’s is .0068. But since the SE of 
the SE is .1607/+/2 X 300 = .0066, the discrepancy is insignificant. 
We may conclude from this result, and from the results quoted in 
Section V, that our formula for the SE is applicable whether or not the 
judgments are all independent. 

A further objection to this method is that we are again assuming 
normality of distribution of values of C. But, as pointed out above, 
this assumption is not wholly unjustifiable when § is distinctly greater 
than 1.0, z.e. when (as in most matching experiments) C lies between 
about 0.20 and 0.60. Thus the value of C/o. gives a useful measure 
of the reliability of a matching result; the probabilities obtained by 
reference to Kelley-Wood Tables are approximately of the same order 
as those given by Chapman, and as the figures obtained by the x? 





* In calculating this figure, and in other calculations where & was less than 1.0, 
fictitious negative contingencies had to be adopted. £.g., C for § = 0.6 was taken 
as equal to C for §=1.4, but witha minussign. Cf. also footnote on page 4. 
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method. For instance, when C/o, = 0.271/.1396 = 1.941, P = .026, 
as compared with Chapman’s .041. A more important point, which 
has been established by computing a number of specimen C’s and ¢,’s, 
is that data which yield the same probability in different parts of Chap- 
man’s Table also yield approximately the same relative probability 
by the C/o. method. 

This point may be illustrated by another experiment where four 
series of seven hundred twenty runs each were carried out using 
respectively six, four, three and two balls at a time. In the series 
with two balls, each of the fifteen possible combinations of two out 
of six (1.2; 1.3; 1.4; ... ete... . 5.6) was run forty-eight times in 
random order; similarly each possible combination of three or four 
out of six was run an equivalent number of times. Thus the material 
was completely comparable throughout the experiment, only the value 
of twas changed. The results appear in Table II.’ Clearly the proba- 
bility of the four results is practically identical; the differences are not 
statistically significant. This agreement, which is a crucial test of 
our method, might be less close for very high values of S owing to 


TasLeE II.—SratTisticaAL EXPERIMENT ILLUSTRATING THE MATCHING OF THE 
SaMeE MATERIAL WITH VARIOUS VALUES OF ft 








t S per cent C SE x -/N N SE C/SE 
6 28 .33 0.298 1.069 4320 .01625 18.3 
4 39.70 0.317 0.996 2880 .01861 17.0 
3 51.25 0.358 0.933 2160 .02013 17.8 
2 72.08 0.407 0.851 1440 .02244 18.1 























discrepancies between the maximum values of C and 1.0. But moder- 
ate values of S, similar to those involved in these experiments, are much 
more frequently met in actual practice, so that this limitation is proba- 
bly not very important. Supposing, therefore, that we wish to deter- 
mine the correlation of photographs or handwriting, etc. with 
personality sketches by the contingency method, we may rest assured 
that the value of ¢ which we choose, although it will affect the size of C, 
will not affect the probability of our result, so long as the number of 
judges and other conditions are kept constant.* Although a 2:2 





* Though it is not intended to include in this article the consideration of psycho- 
logical factors, yet one “other condition’’ of vital importance must be mentioned, 
namely the likelihood that there exists an optimum value of ¢ for any given type of 
material. In some experiments of the present writer on matching photographs 
with vocations it was found that when t = 3 the task was probably too easy to call 
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matching experiment is actually a far easier task than a 6:6 experi- 
ment, yet this factor is compensated for in the above Table by the 
greater number of judgments in the latter than in the former. Had 
N been the same in both these experiments, the value of C/o, for 2:2 
matching would have been +/3 times greater than for 6:6 matching. 

Summing up the various methods of evaluating matching results, 
we may conclude that Chapman’s Table of probabilities should be 
employed when § and n are small, and when the investigator merely 
wishes to determine whether or not a result is significantly superior 
to chance probability. For most problems the method by which o; 
is determined from the obtained distribution of values of s would seem 
to be the simplest and most legitimate; it only involves the one assump- 
tion—that the chance distribution of values of § does not deviate 
_ greatly from normality. But since psychologists are accustomed to 
measure degrees of relationship in terms of correlation coefficients and 
their PE’s, they will probably prefer the contingency method. For 
though it involves more dubious assumptions than the other methods 
it seems to work well in practice, and subsequent Sections of this 
article will show that it is capable of extension to a variety of more 
complex matching problems. 


V. THE MATCHING OF UNEQUAL NUMBERS OF ELEMENTS 


In a great many of the published matching experiments, the num- 
bers of elements in the two series have been unequal. There is a 
decided practical advantage in such a procedure, namely that it pre- 
vents the Judge from arriving at his final matching by means of a 
process of elimination. For example, if four photographs are matched 
with six or eight, instead of with four character sketches (the extra 
two or four sketches being unrepresented among the photographs) 
a much more genuine process of judgment is demanded from the match- 
ers. Moreover one of the main objections to our formula for «, 
disappears. In an experiment where ¢’ elements are to be matched 
with ¢ others, which we shall call ¢’:¢ matching, the same formulae 
for C and «, still apply, but some caution is necessary in deciding on 
the value of N. In most cases N = nt’ instead of nt. A series of 





forth the Judges’ maximum ability; so that the result was distinctly less successful 
than when the same material was matched with t = 6. On the other hand if ¢ is 
too large the Judges may be unable to cope with it and they will again achieve a 
smaller success. The incidence of this subjective factor does not, however, under- 
mine the statistical equivalency of matching with different values of t, as demon- 
strated above. 
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hypothetical examples are given below in order to make the matter 
clear; most of them have been confirmed by data from the statistical 
experiments with running balls. 

la. Four character sketches are matched with four photographs 
by n judges. This is an ordinary 4:4 experiment where N = 4n, as 
has been shown by the first ball experiment described above. 

1b. One character sketch is matched with four photographs. This 
is a 1:4 experiment; ¢ = 4, ?’ = land N = n. 

2a. Six extracts from the writings of three authors, grouped into 
two separate sets of three each, are matched with photographs of the 
authors. Thisisa2 X 3:3 experiment;¢ = 3,t’ = 3,m = 2,N = 6n. 

2b. Six extracts from the writings of six authors are matched with 
photographs of three of the authors, and the Judges are told that only 
half the extracts will be required. This is a 3:6 experiment; ¢ = 6, 
' = 3, N = 3n. 

This latter example was duplicated by the running balls. In 
one hundred fifty sets of ten runs each of the six balls, only rank 


‘positions one, two and three at the bottom of the plane were considered, 


and the numbers of times that balls one, two and three arrived in 
these positions were recorded. The average value of S was 26.34 
per cent, corresponding to C = 0.251 and to a theoretical SE of .1987. 
The empirical SE of the one hundred fifty coefficients was .2080, a 
value which differs from the expected value by less than its own SE. 

3a. Fifty handwriting specimens, arranged in twenty-five pairs 
each containing one male and one female specimen, are sorted accord- 
ing to sex. This is a 25 X 2:2 experiment; ¢t = 2, m = 25, N = 50n. 

3b. Fifty handwriting specimens, shuffled together, are sorted 
according to sex. If the Judges know that the two sexes are equally 
represented, so that their piles should contain twenty-five each, we 
should call it a 50:2 experiment. But it is more likely, with so large 
a number of specimens, that they will judge each one on its own merits; 
and if (as in most of the published experiments) they do not know the 
proportions of the sexes, they will have to consider each specimen 
separately. This is therefore a 50 X 1:2 experiment; ¢ = 2, ?’ = 1, 
m = 50 and N = 50n. 

In order to illustrate Nos. 3a and 3b, let us first take the result of 
the experiment described above where two balls were run at a time, 
seven hundred twenty times. S per cent was 72.08, C was 0.407 and 
g- 0224. Next if we consider a series of runs with six balls, regarding 
Nos. 1, 2 and 3 as female, 4, 5 and 6 as male handwritings, and count a 
correct match each time a “female ball’ arrives in any of the first 
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three rank positions or a ‘male ball”’ in any of the last three positions, 
we shall reproduce the conditions of Experiment No. 3b. In two 
hundred forty runs with six balls the percentage of such successfy] 
matchings was 71.67, and since N is 1440 in both experiments the 
result is practically identical with that just quoted; C = 0.398, 
go. = .0227. 

It should be pointed out that, although Experiments 3a and 3b 
give results of the same degree of probability by our method, yet 
the sorting of pairs of handwritings according to sex is obviously a 
much easier task than the sorting of shuffled specimens. The reason 
why 3a does not in this instance lead to a greater success than 3b is 
that, in the former each judge does twenty-five matchings, but in the 
latter he has to do fifty matchings and it will probably take him quite 
twice as long to accomplish. Had he spent about the same amount 
of time and effort on each experiment, he could have matched fifty 
pairs in 3a and fifty single specimens in 3b and would then have achieved 
a coefficient with a probable error +/2 times smaller in 3a than in 3b. 

There is still one more variety of matching where two (or more) 
elements in one series are equivalent, and are both matched with one 
element in the other series. We shall call this (¢’):¢ matching, where ¢’ 
is the number of equivalent elements. For example: 

2c. Six extracts from the writings of three authors are matched with 
three photographs; the extracts are mixed up, but the judges know that 
each author is represented by two extracts. We can designate this 
as a3 X (2):6 experiment. 

In this instance, the original formula will still apply, since we can 
treat it as 2 X 3:3 matching, where ¢ = 3 instead of six, and N = 6n. 
In a series of fifteen hundred runs with the six balls, a record was taken 
of the proportion of times that balls one and two reached the bottom 
either first or second, three and four either third or fourth, five and six 
either fifth or last. S per cent was 51.23; 7.e. its value was practically 
identical with that obtained when the balls were run three at a time. 
Grouped into one hundred fifty sets of ten runs, the empirical SE was 
.1162; C = .355 and the theoretical SE = .1204. 

In other experiments of this type, however, a modified formula is 
required; for example: 

3c. Ten handwriting specimens are presented, three of which are 
female and seven male. The judges have to pick out the female ones. 
This may be called a (3):10 experiment, where ¢ = 10, ¢’ = 3 and 
N =38n. S is here the proportion of ‘female’ judgments that are 
correct, not the proportion of total judgments. 
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On working out the contingency coefficient and SE as before, the 
following formulae are found: 


tis mt (St — t’)? 
t'(t — t’) + (St — ¢’)? 

(t¢—t)Vt (= — vt? + ¢— 71 + We — ¢’) — St #7) 
~ aJ/N e[e’(¢ — t’) + (St — ¢’)>f 
To obtain confirmation, five balls were run one thousand times; the 
proportion of times that balls one and two reached the bottom either 
first or second was 59.2 per cent. Grouped into two hundred sets of 
five runs each the empirical SE of the contingencies calculated by this 
new formula was .2004. C for the whole series was 0.365, and the 
theoretical o. (when N = 2 X 5) was .2028. Similar results were 
obtained with (2):6 and (3):7 experiments. * 

















Ce 


VI. SECOND OR FURTHER CHOICES IN MATCHING 


Some Judges, when dissatisfied with their original matchings, often 
ask to be allowed to make a second or further choice for certain of the 
elements. We should therefore try to derive a formula for expressing 
the validity of such additional choices. It is essential not to insist 
on a second choice for every element, since then a Judge’s correct first 
choices would necessarily yield incorrect second choices. Suppose 
then that in an ordinary ¢:¢ experiment the Judges give an average of 
tg second choices, nt, in all. Each of these is matched against ¢ -- 1 
other elements (not against ¢ elements since the first choice is already 
eliminated from consideration). Similarly né; third choices may be 
made, each being matched against ¢ — 2 other elements. Let Sz 
and S; be the proportions of correct second and third choices. The 
resulting formulae for C and o, are so complicated that we will merely 
present here the formulae for x? and for the cell ¥* function (i.e. Ny? or 


2(f — f-)*/f.2), from which C and o, may be calculated. 
nt 


x? based on first choices only = @— 1) — 1)?. To obtain ¢? 
divide by N = nt. : 
x? based on first and second choices = qo —1)?+ 
nto 


Tp iSelt — 1) —1)]*. To obtain ¢? divide by N + N2 = n(t + t2). 





*The formulae do not apply when ?’ is greater than ¢/2. For instance the 
identification of seven male handwritings out of ten male and female specimens 
should be treated as (3):10, not as (7):10 matching. If t’ = %t the matching 
reduces to 1:2, as in Experiment 3b. Note also that the SE of (2):6 matching is 
not the same as that of 1:3 matching, although 3 X (2):6 is the same as 3:3. 
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x? based on first, second and third choices = Gay (st — 1)?4 


G5 alse - 1) - 1+ Go qa lSutt 2) — 1]. To obtain ¢? divide 
by N +N2+WN3 = n(t +t. +s). If further choices are included 
the formulae for x? can be extended in accordance with the above 
pattern. 


Cell y* function based on first choices only = nl + @_tD? it 
nt 


1 
eit a), T=7} 
[S2(¢ — 1) — 1]* should be added to the previous function. If third 


aap lst — 2) — 1]* should be 


added. The series can be similarly extended if more choices are 
included. To obtain y*, the summed functions are divided by the 
total value of N as before. C anda, are then obtained by substituting 
for ¢? and y’ in the standard formulae, quoted on p. 7. 

It should be noted that unless S, is considerably larger than S, 
or S; than S2, the inclusion of these additional choices usually reduces 
the size of C. But since x? is always increased so long as S. > 1/t — 1 
and S; > 1/t — 2, the ultimate probability of the result (C/c.) should 
also be somewhat increased. 

The present writer has not, so far, been able to devize a simple 
statistical experiment for trying out the above formulae. The fol- 
lowing is a hypothetical illustration. One hundred Judges match 10 
photographs and sketches, obtaining thirty-five per cent correct; 
fifty of them give an average of six second choices each, of which thirty 
per cent are correct; thirty of them give an average of four third choices 
each, of which twenty-five per cent are correct. Thust = 10, f = 3, 
t; = 12; N = 900, N2 = 300, N; = 120. x? from first choices 
only = 625, from second choices only = 108.4, from third choices = 
17.1. The corresponding values of cell Y* function are 1423.6, 166.3 
and 15.3. 


(St — 1)*. If second choices are included, pe 


choices are included ¢— = nt! + G 


C from first choices = 0.640, o. = .0243, C/o. = 26.3 

C from first and second choices = 0.615, ¢. = .0226, C/o. = 27.3. 

C from first, second and third choices = 0.602, o. = .0222, C/o, = 
27.1. 

This indicates that the inclusion of additional choices makes very 
little difference to the final result unless ¢, and S; etc. are large. It 
would also seem to lessen the weight of a possible objection, namely 
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that a Judge can get all the elements right if he makes sufficient addi- 
tional choices. 


VII. OTHER DEVELOPMENTS OF THE METHOD 


It remains to discuss certain other methodological possibilities of 
the contingency matching technique. 

Since we have verified the adequacy of the SE formula by means 
of the experiments with balls, it may also be legitimate to compare 
the contingency coefficients derived from two different experiments 
by the ordinary SEar, method. For example, if two groups of Judges 
match the same material and obtain coefficients C; and C2, then the 
significance of the difference between their abilities (being uncorrelated) 
may be derived from the usual formula, cain, = Woe,?2 + 0,2. Simi- 
larly if a group of Judges matches, say, character sketches first with 
the voice and secondly with the features it should be possible to show 
whether voice or features are—to a significant extent—the more 
revealing; and so on. 

As a rough test of this supposition the two series of seven hundred 
twenty runs of six balls and of four balls were taken, and the differences 
between the C’s in successive sets of ten runs were plotted as a fre- 
quency distribution; the empirical SE of the difference was .2220, 
Now the average result in the two series (cf. Table IIT) was C = 0.298. 
o, = .1377 (when N = 60), andC = 0.317, 0. = .1575 (when N = 40). 
Hence by the above formula cai, = .2091. As the SE of the obtained 
SE is .0185, the agreement is good. 

Presumably contingency coefficients from several experiments can 
be summed or averaged, as is often done with correlation coefficients; 
but in general the combination of matching results can be more legiti- 
mately effected by determining x? and Ny’ for each result separately 
and then summing or averaging these quantities.* An example of this 
procedure was given in the preceding Section on second choices in 
matching. Future investigators may, perhaps, be able to work out 
the application of factor analysis techniques to contingency coeffi- 
cients; for example it might be interesting to factorize (a) the various 
modes of expression of a group of Subjects (7.e. persons matched); 
(b) the various matching abilities of a group of Judges. 

The technique embodies a promising instrument for psychological 
research which is quite outside the scope of correlation techniques, 
namely the study of individual cases. If, for example, the features 
of five individuals, A, B, . . . are matched with sketches of their 








* Since Pearson’s n’ is always two in all matching experiments. 
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personalities, separate coefficients may be calculated for each individua] 
from the proportion of correct matchings which he receives. A’s 
features may so be found to be more expressive than B’s, and so on. 

An advantage of matching with unequal numbers of elements js 
that the difficulty of the experiment can be very readily graded, 
For instance, if three detailed personality sketches are available, the 
matching of these with only three photographs may be too easy a 
task for the average Judge. In that case we merely have to add 
some more photographs, say six more, of individuals whose sketches 
are not given, in order to convert the 3:3 matching into the much 
more difficult task of 3:9 matching. If a dozen sketches and photo- 
graphs (or other sets of material) are available, it may be better to 
match four of the sketches with eight of the photos, and the other 
four photos with the other eight sketches (2 X 4:8), than to match 
them in equal subgroups (e.g. 2 X 6:6 or 3 X 4:4), or all at once 
(12:12). 

The comparability of contingency and correlation coefficients is 
a matter of some importance. For the general level of correlations 
has become a widely accepted convention (e.g. > 0.80 = high, 0.80 
to 0.60 = moderate, etc.). Hence we must ask whether contingencies 
may be similarly interpreted (apart from their reduced maxima when 
the number of cells is small), or whether, possibly, the contingency 
technique greatly exaggerates the size of coefficients in the same 
way that coefficients of association or colligation are exaggerated. 
Again there are investigations where it would be useful to be able 
to determine whether more successful judgments of personality can 
be attained by means of the matching technique than by means of 
rating and ranking. Can we then show that the C which is derived 
from a given value of S is not higher than the corresponding r, if it 
were possible to calculate r, nor the c, lower than the corresponding o,? 
Although a correlation coefficient cannot in general be directly 
expressed as a percentage figure which represents the degree of overlap 
between the two series which are correlated, yet Hull® has shown that 
the reverse of the coefficient of alienation, 1 — +/1 — r?, does repre- 
sent the degree of reduction of error in forecasting one series from the 
other. Hence he calls 1 — +/1 — r? the “forecasting efficiency” of 
an r.* Somewhat analogously the excess of successful matchings over 


1 ‘ , 
chance (s — i) may be said to represent the forecasting efficiency of 4 


C. If this rather dubious comparison be accepted we find that so 





* Cf. Douglass’s discussion of the limitations of this efficiency index.‘ 
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long as ¢ = 9 or less, then the C which corresponds to a certain value 
of S — : is invariably lower than the r which corresponds to an equiva- 


lent value of 1 — »/1 —7*.* Thus in some ninety per cent of the 
published matching experiments the results expressed as contingency 
coefficients would be somewhat smailer than correlation coefficients 
of the same so-called forecasting efficiency. 

Again, if the SE of a C could be calculated by the formula for r, 





namely VN » the values so obtained would in about seventy per 


cent of the published experiments be smaller than the values obtained 
by the proper formula for o,.f We may conclude then that the con- 
tingency technique for evaluating matching probably yields results 
which are of the same order as, but in general a little less favorable 
than, the results which might be expected if correlation techniques 
could be applied to the same matching data. 
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* When ¢ = 10 or more, some of the middle values of C are higher than r’.s 
But when ¢ = 2 or 3, all values of C tend to be distinctly lower, since their maxima 


are .707 and .816, i... v=. instead of 1.0. 


2 _( 
+ Whenever S < ae 2 then 1-c is smaller than a-. 
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Although there appears to be an impression that the Porteus Maze 
test might be used as a substitute for the Binet, Porteus seems never 
to have claimed that his test was more than a supplement to the Binet. 
Recently (1933) he has defended this stand by an appeal to his own 
early writings. This appeal seems entirely justified in view of the 
fact that at no point can he be found stating that his Maze test is to 
replace the Binet, but does say repeatedly that it serves as a useful 
supplement to the Binet. 


In his earliest article Porteus (1915) points out some current objec- 


tions to the Binet test and says in regard to differences between Binet 
and Porteus scores, ‘‘ Where there is a wider difference between the two 
estimates it is not contended that the Binet is incorrect, but simply 
that the Motor Intelligence series tests in other directions.”” Relative 
to the need Porteus says, ‘‘Such a series, if correctly graded, would 


prove a somewhat valuable supplement to, and partial corrective of, | 


the Binet-Simon scale. The Motor-Intellectual series of tests (Origi- 
nal Porteus Mazes) which is described in this article, is the outcome of 
an attempt to supply this want.” 

Two years later Porteus states his claims as, “It is not claimed 
that they (Porteus Mazes) enable us to arrive at the general mental 
age of the subject, though in the majority of cases there is a close 
correlation between results by these tests, and by the Binet-Simon. 
. . . Mental age per the Porteus tests means that, in the capacities of 
foresight, prudence, resistance to suggestion, and sustaining the 


attention, the child has reached the average development of the age | 


assigned to the tests passed under the given conditions.” (1917, 
p. 22.) 


In 1918 Porteus was still relating his Mazes to the Binet test as 
supplementary and insisted, “‘By careful observation . . . facts 





1 Publication of the Indiana University Psychological Clinics. Ser. II, No. 9. 
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regarding children’s dispositions become apparent through the use 
of the tests. These facts are not revealed by the Binet examination, 
and have an important bearing on the evaluation of intelligence. 
It is claimed that the tests form a necessary supplement to the Binet 
scale.” ‘‘The relation between this series and the Binet tests is for 
large groups of children, fairly constant; about seventy per cent test 
within one year of their Binet age.”’ (1918, p. 31.) 

Porteus quotes himself on some of these early studies in his defence, 
concluding, ‘‘ All of these preliminary studies indicated a certain use- 
fulness for the new test in the field of mental examinations. With 
defectives at the lower levels of ability where diagnosis was least 
difficult there was a close agreement in most cases with the Binet. 
The most marked disagreement was found, however, among the higher 
grade cases, and provided the validity of the Maze could be sufficiently 
established, it would be extremely valuable as a diagnostic aid in those 
cases where diagnosis is most difficult.”” (1933, p. 84.) 

As a measure of validity Porteus considers the comparative Porteus 
Maze score and Binet Test score in regards to delinquency, and decides, 
“This indicates that for children of defective levels of intelligence the 
Maze has a closer relation to social sufficiency than the Binet.” 
(1933, p. 99.) 

As a parting statement of his claims Porteus gives the following, 
“All that the Maze test does is to measure approximately the ability 
of the individual to use planning capacity in a task at or about moron 
levels, and to readapt his methods in the face of increasing difficulties 
as far as his temperament will allow him to do so.” (1933, p. 167.) 

Other writers, however, have not been as careful as Porteus was, 
and have considered the Maze test a substitute for the Binet. In his 
summary of the Porteus Maze test, Burt (1922) concludes, ‘‘ The maze- 
tests, therefore, supplement, though they cannot, I think, supplant, 
the other scales (chiefly the Binet) in a profitable way.” While this 
statement is, in fact, entirely in agreement with Porteus’ own claims 
it would appear that Burt unintentionally misled his British and 
American readers by stating his own opinion as if it were in disagree- 
ment with other authorities—presumably the author of the test 
included. 

Worthington (1926, p. 217) states the purpose of his study, “‘to 
determine whether these tests (a battery of performance tests including 
the Porteus Mazes) contribute any additional information to the 
general intelligence rating measured by the Stanford-Binet scale, 
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and to determine the best substitutes to be used when the Stanford. 
Binet or other intelligence tests cannot be applied,” in such a way ag 
to infer that the performance tests considered have been offered ag 
substitutes for the Binet test. 

Accepting the Binet test as a measure of general intelligence, 
Worthington chooses to evaluate other tests in the following manner, 
“Tf a test proves to have a low correlation with the Stanford-Binet, 
it is evident that the test does not measure general intelligence; 
if the correlation is high, it may be concluded that the test measures 
at least some factors oi general intelligence, and is, therefore, valuable 
as a substitute for the Stanford-Binet.” (1926, p. 220.) 

With reference to the second highest correlation yet reported 
between the Porteus Mazes and the Binet test, Worthington thus 
evaluates substitute possibilities: “‘It has been said that the Porteus 
Maze is a measure of social adaptability, considering this trait inde- 
pendently from intelligence, but the high correlation of 0.75 with the 
Stanford-Binet indicates that if the test is really a measure of social 
adaptability, then this trait is a function of general intelligence.” 
(1926, p. 220.) 

This high correlation, by the way, is calculated without consider- 
ing the chronological age of the subjects as is indicated by the state- 
ment with which Worthington’s method is explained. ‘‘The study 
was made by means of correlating the performance test mental ages 
with the Stanford-Binet mental age, and studying the distribution of 
mental ages.” 

The present study is not the first to consider the correlation 
between Porteus Maze scores and Binet test results. Porteus in 1918 
correlated the scores for two hundred feebleminded children and 
one hundred ninety normal school-children on the Goddard revision 
of the Binet test with their scores on the Porteus Maze. He also 
correlated the scores of two hundred sixty-three normal school- 
children on the Terman revision with their Porteus Maze scores. The 
correlations for these three groups were respectively .70, .69, and .77— 
this last the highest correlation yet reported. Berry and Porteus 
(1920) found correlations for school children in Melbourne, Australia, 
for boys and girls, and for each age-group from six to thirteen sepa- 
rately. These correlations ranged from .24 to .75 between the 
Stanford-Binet and the Porteus Maze for groups of twenty-eight to 
seventy-seven cases each. Morgenthau (1922) with a group of one 
hundred ten cases reported a correlation of .54 in 1922. Gaw (1925) 
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working with boys and girls separately and within a given age-group 
(av. 13.5) in 1925 reported correlations between the Stanford-Binet 
1Q’s and Porteus Maze scores of .52 for boys and .29 for girls. Worth- 
ington (1926) working with one hundred eighteen cases from the 
Illinois Institute of Juvenile Research, and fifty additional unselected 
cases reported correlations between the Binet and Porteus of .75 for 
the Juvenile cases and .59 for the unselected. These studies will be 
presented in tabular form presently. 

The present study was made on one hundred boys and one hundred 
girls between the ages of five and fourteen from the Indiana University 
Psychological Clinics at Bloomington and Indianapolis. These cases 
were all given both the Stanford-Binet and the Porteus Maze as part 
of their regular clinical examination. 


TaBLE ].—AveERAGE Test ScoREs 











Boys Girls 
Ss «sx ara Qs a deealabell nll ea alae eae 100 100 
EE os beh enti Rnihe cae adexeebanne wend | 9-4 9-9 
se ns ok nate wea eekaeee mee | 8-1 8-11 
EE A Ta ee | 86 91 
Ee) ME Re | 9-0 8-11 
I hws hae ek ee eee RSs 96 91 








The data in Table I show the average scores for these two hundred 
cases. It will be noticed that the girls are slightly older than the boys. 
In terms of Binet IQ the girls are superior, while in terms of the Porteus 
MQ the boys are superior. The boys’ average Porteus MQ is notice- 
ably superior to their own average Binet 1Q’s. The relationship 
between the performances on these two tests is shown by the correla- 
tions given in Table II. Attention should be called to the differences 


TaBLE II.—CoRRELATIONS BETWEEN BINET AND PortTEvs ScoRES 











Boys Girls 
Between 
r PE r PE 
EE ee ee sme .388 | .058 .52 | .049 
Binet MA and Porteus MA...................... .61 | .042 .73 | .031 
SS ee ee .42 | .056 .36 | .055 
Binet MA and Porteus MA with CA constant...... , .68 
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in correlation for the boys and girls. Thus the relation between CA 
and Binet performance is higher for the girls, but the reverse is true 
between CA and the Porteus performance. Also the correlation 
between Binet and Porteus performance is higher for girls. When 
these correlations are corrected by partialing out the effect of CA the 
boys’ correlation is lowered by .07 while the girls’ is lowered by .05, 
Similar sex differences in the correlations have been noted by other 
investigators. 

Table III presents a summary of correlations between the Porteus 
Maze and the Binet tests reported by various investigators. Correla- 
tions in the left half of the table are based on groups with a narrow 
CA range, while those in the right half have a wide CA range. It 
will be seen that the range of correlations for groups having a narrow 


TasBLE II].—CorRRELATIONS BETWEEN PortTEvUs Maze AND BINET AGEs 











Narrow CA range Wide CA range 
. No. , No. 
Subjects ens * Reference Subjects a Reference 

School children, boys. Berry and (A) F.m. CA not Porteus, 
a reer 43 | .42 | Porteus, eee 200 | .70 1918. 
Pe ccceccowneee 63 | .24)] 1920. (B) School children | 190 | .69 
i. senesneeeee 76 | .39 (C) School children | 263 | .77 
DP cccccosccces 63 | .56 School children, 7-16 Morgenthau, 

Rs ches cet. ead 49 | .61 ON 6 ao atc aus hana 110 | .54 | 1922. 
an ekeeeee Oe 60 | .60 (A) Cases Ill. Bur. Worthing- 
I ciwawsnee dime 66 | .55 Juv. Res. 5-16 yrs.| 118 | .75 | ton, 1926. 
Se 56 | .39 (B) Unselected chil- 

Girls. ithe eéwnesand 50 | .59 
ieeekecewe 28 | .48 I.U. Psychological Present 
a er 63 | .57 Clinic cases. 5-14 study. 

Pe Dvecaccseoeses 76 | .41 yrs. 

Pi cnvnccee beet 77 | .75 ee 100 | .61 
ee 70 | .61 enna 100 | .73 
a es ik iid id 50 | .56 
i censkecne wed 61 | .46 
Ee 42 .63 

Children about to Gaw, 1925. 

leave school. 
Average age 13.5. 
ee 52 | .52 
ee 48 | .29 
I.U. Psychological Present 
Clinic cases, 5-14 study. 
yrs. Corrected for 
C.A. 
ee 100 | .54 
Piéiecesonees 100 | .68 
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age-band is .24 to .75 while the range for groups having a wide age- 
band is .54 to .77. The median correlation for the narrow age-band 
group is approximately .54, while the median correlation for the 
wide age-band is approximately .69. Although correlations for 
different sized groups are not amenable to averaging, the means of 
these groups are of some interest as they rather closely check the magni- 
tude and directional trend of the medians. These averages were .51 
for the narrow age-band group and .67 for the wide age-band group. 


SEX DIFFERENCES 


As we have earlier mentioned, our data indicate a higher correla- 
tion between Binet and Porteus performance for girls than for boys. 
Furthermore, the elimination of CA as a common factor affects the 
two correlations differently; the boys’ correlation decreasing by a 
slightly greater amount than the girls’. Unfortunately the data 
summarized in Table III have not all been presented with the sexes 
separately. ‘The two studies other than our own in which they have 
been separated disagree. 

Berry and Porteus (1920) present data on boys and girls separately. 
The median correlation for girls, six to thirteen years of age is approxi- 
mately .56, that median coming from a range of .41 to .75. The boys 
from similar age groups show a median correlation of approximately 
48 with a range of .24 to .61. Gaw (1925) working with children 
about to leave school found a correlation between Binet IQ’s and 
Porteus Maze scores of .52 for boys and .29 for girls. Reasons for 
this reversal are not evident, but that there is a disagreement would 
suggest that further investigation is desirable. 

When one considers the apparent clinical importance of the Porteus 
Maze test as indicated by its standing in second place in the survey 
of tests used in clinics as reported to the Clinical Section of the Ameri- 
can Psychological Association it is surprising that more investigators 
have not studied its relation to the Binet. Yet of the twenty-eight 
correlations shown in Table III nineteen have been published by 


- Porteus himself. Because this appears as an important problem some 


discussion of the data available would seem pertinent. 
The primary question with which we are concerned in this study 
s “How closely does performance on the Porteus Maze duplicate 
ln on the Binet?” In our attempt at least tentatively to 
suggest an answer to this question we have presented a survey of 
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correlational data based upon our own group of two hundred cases 
as well as similar data from other investigations. 

These correlations may be grouped into two classes: those based 
on groups with wide age-ranges and those based on groups with narrow 
age-ranges. As we have shown, the median correlation coefficient 
for the first group is approximately .69, while the median for the 
second group is only .54. In our own original group the difference 
between the correlations with and without correction for chronological 
age is only about one-half of the difference between these medians, 
In any case it is evident that one of the errors introduced in reported 
correlations between the two tests is the failure to eliminate chronologi- 
cal age. One investigator (Worthington), at least, on the basis of 
such a spurious correlation has intimated that these tests are measuring 
the same thing. From the very nature of the tests it is obvious that 
performance improves with increase in chronological age, therefore 
it would appear to be an immediately evident precaution to eliminate 
this influence in any studies of the relationship. 

For the sake of the present argument let us assume that the median 
coefficient of .54 is approximately the true relation between the Binet 
and Porteus tests. Such a correlation would suggest that of the factors 
basic to successful performance on these tests about thirty per cent 
are common to both. What the common factors are must for the 
present remain speculative. From the point of view of clinical usage 
we may take them to be such factors of “‘ general intelligence’”’ that, by 
convention, the Binet is supposed to measure. Porteus has at various 
times suggested that the maze performance is affected by foresight, 
planning capacity, prudence, resistance to suggestions and the like. 
Porteus and Babcock (1926) have suggested the term ‘“ psycho- 
synergic” to include all such non-intellectual factors. Direct evidence 
of the influence of these elements on maze performance is almost 
entirely lacking. Yet this question is of sufficient importance to 
warrant experimental investigation. 

The differences between the correlations for boys and girls suggests 
another problem. If further work substantiates higher correlations 
for girls it would appear that in that sex “‘general intelligence”’ is less 
affected by non-intellectual factors, at least in Porteus Maze per- 
formance, than it is in boys. However, as there is disagreement in 
the data of different studies the reported differences may be accidental. 

Our survey of evidence would appear strongly to corroborate 
Porteus’ own contention that the Maze test is a supplement to the 
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Binet. Together the two tests give a better picture of a child’s 
performance ability than either by itself. This has been shown 
experimentally by Porteus (1920) who found that the average of Binet 
and Porteus score had a higher correlation with social adaptation 
in the feebleminded than did either test alone. But this study stands 
as an isolated example. No one has yet undertaken an adequate 
study of the factors involved in Porteus Maze performance. This 
paucity of research when viewed in the light of the test’s clinical useful- 
ness would indicate a serious need of more extensive research. 
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THE DEXTRALITY QUOTIENTS OF FIFTY 
SIX-YEAR-OLDS WITH REGARD TO 
HAND USAGE* 


WENDELL JOHNSON AND DARLENE DUKE 


State University of Iowa 


This study represents an attempt to apply controlled observational] 
techniques to the problem of measuring, in terms of hand usage, the 
degree of right-handedness among six year old children. The study 
may be regarded as a preliminary attempt to evaluate the feasibility 
of determining age norms with regard to hand usage, and of scoring 
handedness measurements in terms of the dextrality quotient (DQ). 
The DQ is to be defined as the percentage of the total achievement 
involved in any test of handedness which is to be credited to the right 
hand. 

Handedness has been studied by means of methods designed to 
measure strength, speed of movement, steadiness and accuracy, pattern 
organization in movement, anatomical structure, physiological 
aspects of handedness (such as the characteristics of muscle contraction 
and nerve conduction), and the characteristics of hand usage. By 
hand usage is meant an expressed or demonstrated preference for 
using one hand rather than the other. Hand usage may be studied 
by means of questionnaire or interview methods, the subject’s state- 
ments being accepted as data. These methods have been used by a 
number of investigators. Studies by Downey,! Jasper? and Koch! 
may be cited in this connection. Hand usage may also be studied by 
means of direct observation, which is the method used in the present 
study. Most handedness studies which have involved observational 
methods in any sense have emphasized the investigation of strength, 
speed, steadiness, or pattern organization rather than the relative 
frequency of the use of either hand. Updegraff' used controlled 
observation in a study of preferential handedness in children from 
two to six years of age, although her chief aim was to determine the 
age at which hand preference asserted itself. Perhaps the previous 
study most closely related to the present investigation was that of 
Jones,* who studied dextrality as a function of age, employing a 
measure of proportion of right-handedness. 





* This study is part of a comprehensive research program at the University of 
Iowa under the direction of Professor Lee Edward Travis. 
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PRELIMINARY INVESTIGATION 


Procedure and Subjects——In the preliminary phase of this study 
fifty first-grade children were observed in a classroom situation. 
The only types of children excluded from investigation were those who 
presented stuttering, motor deficiencies of such a nature as to affect 
hand preference, or obviously subnormal mentality. 

The purpose of this preliminary observation was to ascertain the 
types of unimanual activities characteristic of first-grade children in 
the classroom. It was a further purpose to determine which activities 
were most representative of the children’s handedness. 

An activity was regarded as unimanual if there was relatively 
equal opportunity for its performance by either hand. Activities 
involving the use of both hands were regarded as unimanual if the two 
hands were used with unequal significance, one hand playing the 
dominant réle, provided further that there was equal opportunity for 
either hand to assume dominance.* 

Thirty-one hours and thirty-five minutes were spent in making 
observations of fifty children. The ages of these children ranged 
from five years, one month and ten days to seven years, five months 
and one day. The average age was six years and three months. 
Twenty-five were girls and twenty-five were boys. The amount of 
time spent observing any one child ranged from fifteen to one hundred 
minutes, the average being 37.9 minutes. The observations were made 
at various times of the day. Observations were made of twenty-six 
children in the University Elementary School, Iowa City, eighteen 
children in the public schools of Burlington, Iowa, and six children 
in the public schools of Birmingham, Iowa. 

One child was observed at a time. Activities were recorded in 
the order of their occurrence, and the hand performing each activity 
was recorded. A total of one thousand one hundred fifty-three 
observations were made, and these involved one hundred twenty-five 
kinds of unimanual activity. 

Reliability of Preliminary Observations.—One of us and another 
person observed simultaneously for six hours and five minutes, and 
for the commonly observed activities an agreement as to hand usage 
of 99.3 per cent wasfound. Another individual worked simultaneously 
with one of us for fifty minutes with the result of 94.8 per cent agree- 
ment as to hand usage. 





* See Updegraff* for a discussion of criteria of unimanual activities. 
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Results of Preliminary Observations.—After these observations had 
been completed, the dextrality quotient (DQ) for each child wag 


n+2 
found by means of the formula DQ = ~~ in which R and B 


represent the number of operations performed by the right hand and 
by both hands (neither hand predominating), respectively, and N 
represents the total number of operations performed. The DQ 
of each activity as performed by all the children observed performing 
it was also obtained. The DQ thus obtained for each activity and 
the average total DQ of the children performing that activity were 
compared, and those activities for which the difference ranged from 
0.00 to 0.11 were selected for the construction of a test to be described 
presently.* Writing, cutting with scissors, and turning over paper 
were observed in the preliminary study, and although the difference 
between the DQ for the activity and the DQ for the group performing 
the activity was greater than .11 in each case, they were included in 
the test because it was believed they would combine well with the other 
activities selected and would furnish a source of additional observa- 
tions. In building the test, it was found that the use of blocks, 
although not found in the classroom situations observed, gave added 
interest as well as opportunity for more manipulations. Thus, a total 
of twenty-seven activities were selected and arranged in a series 
suitable for testing purposes. 


MAIN STUDY: CONTROLLED OBSERVATION OF SELECTED HAND ACTIVITIES 


The Controlled Observation Test of Hand Usage——The material 
needed in giving this test consists of a box of crayons, a pencil with an 
eraser attached, scissors, five regular flash cards, a tablet of paper 
(8 X 5inches or small enough to be easily manipulated with one hand), 
eight blocks (114-inch cubes or a size which will require some skill in 
handling), a desk with an open shelf beneath (such as is found in the 
ordinary classroom), and a chair for the tester placed in front of the 
desk. The particular desk used in giving all but eight of the tests in 





* The individual DQ’s for each child and for each activity are given in the 
Appendix of the copy of this report on file in the State University of Iowa Library. 
In this Appendix are to be found, also, a list of the one hundred twenty-five activi- 
ties observed, the number of children performing each activity, the average number 
of observations per child for each activity, and the number of left- and right 
handed performances of each activity. 
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this study was 1844 X 1114 X 24 inches, and the chair was of the 
swivel type. 

Before the child enters the testing room, the tablet, pencil, crayons 
and scissors are placed in the desk. The following is the conversation 
carried on by the tester with the child upon entrance of the child into 
the room, the tester taking careful note of the hand used for the 
activities which are underlined and recording it on a suitable form. 

“Please sit down behind this desk and fold your hands while I 
write your name. (Note which thumb is uppermost when hands are 
folded. The hands should be folded in such a way as to interlace the 
fingers. If the child fails to fold the hands this way he may be 
prompted to do so.) Now will you please take the articles out of the 
desk one at a time and place them on the desk.”” (Make one observa- 
tion for each article taken from the desk.) 

The tester picks up the articles and places the tablet directly in 
front of the child, so that it is as easily reached with one hand as the 
other, and says, “Please tear out one sheet of paper. Then put the 
tablet in the desk. Turn the paper over and when you are through 
raise your hand.”’ 

The tester next places the pencil directly in front of the child 
(note the hand used in taking the pencil) and continues: “‘ Write your 
name at the top of the paper. No, erase it and write it at the bottom 
of the sheet. When you are through raise your hand. Now will 
you please draw a picture and when you have finished put your pencil 
in the desk and raise your hand.” 

Next, a box of crayons is placed directly in front of the child 
by the tester who says: ‘‘ Now color the pieture.”” (Note which hand 
is used in opening the box, in taking each crayon out, in coloring, in placing 
each crayon in the box, and in closing the box.) ‘‘When through place 
the crayon box in the desk and raise your hand.” 

For the next activity the tester places the scissors directly in front 
of the child. (Note the hand used in taking the scissors.) ‘Cut off 
the bottom line. Then put the scissors in the desk and raise your hand. 
Now fold your hands while I put the blocks on the desk.”’ 

Next the tester places the eight blocks in a straight line from one 
side to the other as near the middle of the desk as possible, and then 
says: ‘Point to A, point to M, etc.,’”’ (allowing an interval of one or 
more seconds to elapse between commands. Do not call off the blocks 
in order of arrangement. See that the child gets his hands in a resting 
position or folded before calling off the next block. There will be eight 
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observations for this activity.) ‘‘Now, pile up the blocks one op 
top of the other, making a pile eight blocks high, being careful to pick 
up one block at a time.” (Note the hand used in picking up each 
block and in placing it. There will be seven or eight observations for 
each of these activities, depending upon whether the child moves the 
first block.) The tester again lays the blocks on the desk saying. 
‘Please carry these over to another desk one at a time,” (or any other 
convenient spot. Note the hand used in picking up the blocks. There 
will be eight observations for this activity.) 

Finally, the tester places five cards across the top of the desk, 
saying to the child; ‘‘ Point to the cards as I name them. Please pick 
them up one at a time, shuffle them and lay them down one at a time.” 
(There will be five observations for each activity; pointing, picking 
up cards, laying down cards.) 

This concludes the test. The tester should feel perfectly free to 
carry on additional conversation, letting the child respond spon- 
taneously to any of the situations which arise, as this will bring forth 
a more normal reaction. 

Following is a list of the activities underlined in the above descrip- 
tion of the test; numbers in parenthesis indicate the usual number of 
observations for each activity: 


Clasping hands................ (2) Closing crayon box........... (1) 
Taking articles from desk....... (4) Putting crayon box in desk.... (1) 
RIED, Coc ovesteceseses (1) Picking up scissors........... (1) 
Putting tablet in desk.......... (1) Cutting with scissors......... (1) 
Turning over paper............ (1) Putting scissors in desk....... (1) 
Ee (4) 0 Eee errr (13) 
Picking up pencil.............. (1) Picking up blocks to build.... (7 or 8) 
ES oe (2) Placing blocks for building.... (7 or 8) 
ee ee (1) Picking up blocks to carry.... (8) 
iB ha nin 66-66 (1) Picking up cards............. (5) 
Putting pencil in desk.......... (1) 0 rer ee (1) 
Opening crayon box............ (1) Laying down cards........... (5) 
Taking crayons out............ (1+) 

ska Lenk siedbcnesees si (1) 

Putting crayons in box......... (1+) 


Procedure and Subjects—The above test was administered to fifty 
six-year-old first-grade children. Thirty-five children were tested 
in the public schools of Burlington, Iowa; six were tested in the 
public schools of Iowa City, Iowa; and nine were tested in the Soldiers’ 
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Orphans’ Home at Davenport, lowa.* As in the preliminary investi- 
gation, the only children excluded were those who presented stuttering, 
motor deficiencies of such a nature as to affect hand preference, or 
obvious subnormal mentality. t 

The time required to administer the test was approximately ten 
minutes per child. It is of importance that the test held a great deal 
of interest for the children, and it was not necessary to resort to coaxing 
in any sense in order to elicit responses. 

Reliability A check on consistency of performance was provided 
by retesting each child after an interval of from half an hour to two 
days. The average interval between tests was approximately one 
day. 

DQ’s were computed for each child for the first test (71) and the 
retest (7'2). The coefficient of correlation between 7’; DQ’s and T, 
DQ’s was .92 with a PE of +.015. 

DQ’s were also computed for each activity as performed by all 
children for 7; and JT». The coefficient between these DQ’s for 7; 
and 7, was .94 with a PE of +.015. 

A further check on reliability of the test was made as follows. 
The DQ’s of each child for 7; and 7, were computed, and the difference 
between them determined. Then the PE, was established. A 
difference four or more times larger than its PE is to be regarded as 
significant. Of the fifty children, only one presented a difference in 
DQ between 7’; and 7’. which was more than four times its PE. This 
individual was relatively ambidextrous, as indicated by DQ’s of .71 
and .44, and insofar as ambidexterity is manifested by inconsistency 
in hand preference, this difference between 7’; and 72 scores is not a 
serious reflection on the reliability of the test. Another child pre- 
sented a difference which was not quite four times its PE. This 
child also tended toward ambidexterity, as indicated especially by a 
T, DQ of .68. Forty-nine of the fifty children presented differences in 
DQ between 7’; and 7’: which were less than four times their respective 
PE’s. 

The reliability of the test was evaluated in one other manner. 
The DQ’s of each activity for 7, and T were determined by combining 
all observations of each activity as performed by all the children. 
The differences between these DQ’s and the respective PE, were 





* Testing in the Soldiers’ Orphans’ Home was carried out with the permission 
and coéperation of the Iowa Child Welfare Research Station. 
t One child with a marked articulatory defect was also rejected. 
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determined. The greatest difference was only 2.5 timesits PE. Only 
two of the twenty-seven differences were two times as large as their 
respective PE’s. It is to be concluded, therefore, that the test has a 
high degree of reliability.* 

Internal Consistency of the Test —To determine which items within 
the test were significant in representing handedness as measured by 
the test as a whole, the twenty children with the highest DQ’s for T, 
were classed in Group A, and the twenty children with the lowest 
DQ’s for T; in Group B. The average DQ’s of the two groups for 
each activity were compared and the PE of each difference was deter- 
mined. If the difference in DQ was four or more times the PE of the 
difference the activity was considered significantly representative of 
handedness. Out of the twenty-seven activities, nineteen were 
significantly representative. For twenty-one of the activities the 
difference was three or more times the PE,,,,, and for twenty-three 
of the activities it was two or more times the PE,,,.. Therefore, 
all except eight of the activities were significant in representing handed- 
ness as measured by the test asa whole. Following are those activities 
for which the PE,,,, was large with relation to the difference between 
the average DQ of Group A and that of Group B: Putting tablet in 
desk, turning over paper, raising hand, picking up pencil, putting 
pencil in desk, putting crayon box in desk, cutting with scissors, putting 
scissors in desk. Of the eight less representative activities, four 
were ones involving the placing of articles in the desk. 

It is to be considered that this is a relatively severe test of internal 
consistency in view of the rather highly concentrated distribution 
of the small number of scores. This type of check is made on the 
assumption that the average total DQ of the lowest forty per cent 
of cases should be markedly lower than the average total DQ of the 
highest forty per cent. As a matter of fact, the average total DQ of 
Group A is .89 and of Group Bis .55. The extent to which the activi- 
ties, considered separately, discriminate between two small groups 
of subjects, differing on the average no more than these do, is indicative 
of a rather high degree of internal consistency of the test. To have 
made Groups A and B more extreme would have rendered them too 
small, but a fair test of internal consistency would require their 
being more extreme, say the upper and lower deciles. 





* The detailed data upon which these two paragraphs are based and also the 
data concerning the internal consistency of the test, are to be found in the Appendix 
of the copy of this report on file in the Library of the State University of Iowa. 
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Distribution of DQ’s, First Test.—Distribution of the 7, DQ’s is 
graphically presented in Fig. 1. 
The decile DQ’s are as follows: 


DEcILE DQ Deciiz DQ 

OPT TT Ee TTP TET TTT eee 03 Diese 00.40 bee Chetek eee newer 82 
ee eT ee Pe ee 44 Dcckbhteeeesnl aed kongeekens 85 
it nithe uh td nes aaah ad bee ean .60 i (Mitichensadas sake ken webes 89 
ESE ere errr .67 Dh dh6 654k ah ebe ee hd head One es 93 
Se i Phe a i Re es eee 99 
Snes pale a Mads-dneie Soaweneen 78 


The median, as indicated above, is .78. The interquartile range 
extends from .64 to .87. The figures given here may be regarded as 
tentative norms for normal six-year old children. A larger number of 
cases is needed to establish satisfactorily conclusive norms. 

Inspection of Fig. 1 indicates little justification for describing the 
distribution as bimodal. About all that can reasonably be said is 


that the distribution is somewhat skewed toward the upper end of the 
range. 


SUMMARY 


This study represents an attempt to apply controlled observational 
techniques to the problem of measuring, in terms of hand usage, the 
degree of right handedness among six-year old children. The study 
may be regarded as a preliminary attempt to evaluate the feasibility 
of determining age norms with regard to hand usage and of scoring 
handedness measurements in terms of the dextrality quotient (DQ). 
The DQ is to be defined as the percentage of the total achievement 
involved in any test of handedness which is to be credited to the right 
hand. 

For this investigation fifty first-grade children were observed in 
a classroom situation for the purpose of determining the types of 
unimanual activities which are characteristic of first-grade children 
and which are most representative of their handedness. Mainly 
on the basis of these observations, twenty-seven activities were selected 
and arranged in a series suitable for testing purposes. 

Each child performed these activities in a standardized sequence 
under standardized conditions, and the hand used to perform each 
definite operation involved in each activity was noted. The test was 
scored in terms of the DQ, dextrality quotient, which is the percentage 
of the operations performed with the right hand. An average of 
approximately seventy-five operations was involved in the performance 
of the twenty-seven activities. This test was administered to fifty 
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six-year old first-grade children. As a check on the consistency of 
performance the test was given a second time to the same children. 
The coefficient of correlation between the DQ’s of each child for 
the first test and the retest was .92 + .015, and between the DQ’s 
for each activity as performed by all the children for T, and T; was 
.94 + .015. Differences between DQ’s for 7; and T: for each child 
in performing all activities and for each activity as performed by all 
children were computed and the PE of each difference was determined. 
Forty-nine of fifty children did not present statistically significant 


Figure 1 
Distribution of Dertrality .uotients 
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differences in DQ from 7, to Tz. For no activity was there a statisti- 
cally significant difference in DQ from 7 to T2. Thus, the reliability 
of the test may be regarded as high. 

As a measure of internal consistency of the test, there was com- 
puted the average DQ of each activity as performed by the forty per 
cent of the subjects scoring the highest DQ’s for the test as a whole, 
and as performed by the forty per cent scoring the lowest DQ’s. The 
difference between these averages for each activity and the PE of 
the difference were determined. For all but eight of the activities 
statistically significant differences were found. For all but four 
activities the difference was more than twice as large as the PE of the 
difference. ‘These findings may be regarded as indicative of a high 
degree of internal consistency of the test, particularly in view of the 
necessarily small size of the two compared groups and the relatively 
small difference between these groups with regard to degree of right 
handedness as measured by the test as a whole. The average DQ 
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for the test as a whole of the highest forty per cent, Group A, was .89, 
and for the lowest forty per cent, Group B, it was .55. 

The DQ’s for the fifty children on 7 ranged from .03 to .99, the 
median being .78 and the interquartile range extending from .64 to .87. 
As shown in Fig. 1, forty-five of the DQ’s ranged from .50 to .99; 
the other five were as follows: .03, .14, .40, .44 and .44. There would 
seem to be almost no justification for describing the distribution as 


bimodal; rather, it would appear merely to be somewhat skewed toward 
the upper end of the range. 


CONCLUSIONS 


In this study it has been demonstrated that the hand usage of 
six-year old children in the performance of common unimanual activi- 
ties can be measured with a very satisfactory degree of reliability 
in terms of the percentage of the total number of operations which 
are performed with the right hand. Because of its high reliability, 
the test constructed for this investigation makes possible the deter- 
mination of age norms, at least at the six-year level, with regard to 
the dextrality quotient (the percentage of right handedness) as to 
hand usage observed under standardized conditions. 

The important contributions of this study are with reference to 
the concepts of the dextrality quotient and of age norms with regard to 
handedness. 

The dextrality quotient (DQ), as has been stated above, may be 
regarded as the percentage of the total achievement involved in any 
test of handedness which is to be credited to the right hand. It is 
for all practical purposes justifiable to say that the DQ represents 
percentage of right handedness, the Travis formula by which it is 
derived being a percentage formula. For example, according to the 
present study and in terms of the test here used, the median six-year 
old child is seventy-eight per cent right handed as far as actual hand 
usage is concerned. 

It is reasonable to claim, therefore, that the DQ may be used as a 
universal scoring unit for tests of handedness. As such, it makes 
possible a significant correction of the chaos involved in the present 
heterogenity of scoring units and scoring systems applied to measures 
of handedness: It has been for all practical purposes impossible to 
correlate one handedness test with another, or to establish significant 
handedness test norms. These correlations and norms should be 
readily possible in terms of the DQ. Moreover, there would appear 





ee, 


36 The Journal of Educational Psychology 


to be good reason for saying that the DQ is a more precise and in 
many respects more valid measuring unit than any others previously 
advocated. 

The correlation of one test with another, on the basis of a standard 
scoring unit, should make possible a decidedly more adequate under- 
standing of the nature of handedness, the elements of which it is 
composed, and the interrelationships of these elements in any given 
individual or in terms of group tendencies. 

Age norms in terms of DQ and with reference to standardized 


Try 





tests and batteries of tests should throw a great deal of light on the « 
genetic aspects of handedness and on the significance of various con- ve 
stitutional and environmental factors affecting handedness. a 
The greater preciseness of measurement which is very probably f . 
possible with the DQ should lead to the construction of improved tests Ps 
of handedness, and consequently to a more adequate statement of T 
the relationship between handedness and other physical and mental 
conditions. a 
A suggestive finding from the present study is to be seen in the M 
nature of the distribution of the obtained DQ’s (Fig. 1). This distri- Pe 
bution is somewhat skewed toward the upper end of the range, but _ 
it would be stretching a point quite unreasonably to describe the " 
distribution as bimodal. It will be of great value and interest to -™ 
determine whether similar types of distribution of DQ’s are to be PS, 
obtained from the use of other kinds of handedness tests, applied to ws 
the same and other age levels. Refutation of the commonly held _ 
assumption that handedness is distributed in a distinctly bimodal a : 
fashion would have far-reaching theoretical and practical implications. pe 
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SOME POINTS OF MATHEMATICAL TECHNIQUE IN 
THE FACTORIAL ANALYSIS OF ABILITY 


GODFREY H. THOMSON 
Moray House, University of Edinburgh 


I. INTRODUCTORY 


The present stage in the practice of the factorial analysis of human 
abilities is to proceed somewhat as follows: A number of tests, a dozen 
or twenty, is given to a larger number of people, one hundred or more, 
and the intercorrelations of the tests are calculated. The tests are 
then by mathematical means analyzed into uncorrelated or orthogonal 
factors in such a way that the first factor explains as much as possible of 
the correlations, the second factor as much more as possible, and so on. 
The process is stopped when the experimenter considers that the 
correlations given by the factors are sufficiently like those actually 
observed, in view of the sampling errors. It is usually found that 
from three to six factors are sufficient for this.?2, In addition to these 
few general factors there remain specific factors, one to each test, 
which do not contribute to the correlations. The factors which are 
arrived at by this mathematical process do not however satisfy the 
psychologist, who fails to identify them with any psychological entities, 
and who dislikes the numerous negative “loadings”? with which in 
many cases they are fitted. He therefore proceeds to ‘‘rotate the 
axes”’ which geometrically correspond to the general factors among 
them, leaving the specifics however unchanged, until the general 
factors do correspond to psychological entities which he can recognize, 
or until as many loadings as possible are zero, and none negative. 





1Tt is also possible to exchange the rdéles of the tests and the people, and to 
calculate correlations between people instead of between tests, as I mentioned in a 
paper in the British Journal of Psychology, July 1935, Vol. XXVI, No. 1, p. 75, 
equation (17). Such correlations between persons have not infrequently been 
calculated in comparing examiners, as in a paper by myself and Miss Bailes in 
The Forum (now the British Journal of Educational Psychology) in 1926. I learn 
privately from Dr. William Stephenson, and see from a note of his in Nature of 
August 24, 1935, that the idea of correlations between persons has also occurred 
independently to him, and that he proposes to apply factorial analysis to experi- 
mental data of this nature. His results will undoubtedly be of great interest. 

2 There is a danger here, since the fewer the people who are tested, the sooner 
one can say that the factorially produced correlations are ‘‘sufficiently”’ like the 
observed, for the latter will have large probable errors. 
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The second of these alternative procedures is clearly more objective and 
therefore more scientific than the former. But in fact, in the few 
experiments so far reported, it is claimed that the satisfaction of 
either of these conditions leads also to the satisfaction of the other. 

The factors which are thus arrived at can be estimated separately 
in any individual; and also any occupation can be analyzed into the 
factors which are needed for success therein. Vocational guidance 
then consists in sending an individual possessing the factors in certain 
amounts and relative proportions into that occupation which requires 
them in those same amounts and proportions, as nearly as possible. 

The present paper will examine some of the mathematical details 
of this procedure, will give formulae for the best weighting of a team 
of tests to measure a given factor, and will give arithmetical models of 
the calculations. It will arrive at the conclusion that, mathematically, 
factors are often unnecessary middlemen, and the cause of considerable 
inaccuracy; but it will mention also a counterbalancing advantage 
to be set off against this. Sections II, III, and V may be ignored 
by non-mathematical readers. 


II. THE INDETERMINACY OF FACTORIAL ANALYSES! 


The fact that in a hierarchical set of tests the general factor g 
includes an indeterminate portion 7 which cannot be measured by a 
hierarchical team was pointed out by E. B. Wilson and is now well 
known. The reason is, fundamentally, that in a hierarchical set there 
is one more unknown than the number of equations; for each test has 
its own specific, and there is in addition the general factor g. 

It is clear that the factors will also be indeterminate in any more 
complicated factorial analysis if it leaves a specific in each test, for 
always in that case there will be more unknowns than there are equa- 
tions. As in the simpler hierarchical case, the indeterminacy will 
become less as the number of tests is increased, if no new factors are 
introduced except new specifics. 

This indeterminacy can be expressed by saying that transforma- 
tions can be found, giving the present factors as linear functions of 
other and different factors,.yet leaving the factorial equations of the 
tests unchanged. 

Suppose for example that four tests 21, 22, z; and z have been 
analysed into three general factors g, v, and F and specifics, as follows:? 





1 Section II may be ignored by non-mathematical readers. 
2 All variables are standardized unless otherwise specified in the context. 
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2, = .669 + .52v + .21F + .508; 


22 = .52g + .66v + .5482 (1) 
23 = .74g + .6383 : 
a4 = 37g + T1F + 608, 


If now in these equations we make the following substitutions, 


g= 907 — .10v — 10g + .280; + .2202 + .1203 + .180%4 
v= —.10y + .90» — .10¢ + .280; + .2202 + .1203 + .180%4 
F = —.10y 7 10r fe .90¢ + 2801 + 2202 -- 1263 oa 1804 
8 = .28y + .28v + .28¢ + .2301 — .6002 — .3203 — .500, (2) 
82 = .22Y + .22y + 229 — 600; + 530 a 2503 o- 9054 
8 = .12y + .12v + .12¢ — 3201 — .2502 + .8603 — .2104 
& = .18y + .18y + .18¢ — .500; — 3902 — .21l0o3 + .680, 


we obtain (exactly, if we use more decimal places than are here printed) 
the equations 


2, = 667 + .52y + .21e + .500; 


22 = .52y + .66y + .5402 (3) 
Z23 = .74y7 + .6303 
a = .377 + .7lg + .600, 


These are identical in shape with equations (1) and of course give 
identical correlations. The Roman letter factors and the Greek letter 
factors are not the same (vide equations 2) yet are exchangeable; and 
innumerable other sets can be made. 

In a former paper! I gave an orthogonal matrix which produced 
such transformations of the hierarchical system (which has only one 
general factor) namely: 


2qq' 
B=] — 4 
mo (4) 
wherein J is the unit matrix (Kronecker’s delta, or the idemfactor) and 
q is the column vector 





{go 9: Qe Qs--: }, (5) 
wherein g = —1 and qi, gz, gz. . . are calculated from the first, 
second, third . . . tests by the formula 


_ loading of single general factor (6) 
' loading of specific factor 








1“The Definition and Measurement of g.”’ Journal of Educational Psychology, 
April, 1935. 
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After the q,’s corresponding to the tests have been calculated, the 
remaining g;’s (to infinity) can be given any values whatever, hence 
the freedom with which different transformations can be produced. 
The quantity q’ is the transposed of q, that is, a row vector; q’q is 
therefore the sum of the squares of all the q,’s, and qq’ is a square 
matrix. 

This scheme can be made to fit the case of several generals by 
replacing the single go of the above scheme by as many q’s as there 
are generals, and by taking, for the values of qi, g2, g3, . . . the values 


_ sum of the general loadings in the test 7 (7) 


specific loading in test 7 

for as many values of 7 as there are tests, the remaining q,’s being as 
before arbitrary. The equations (2) were made in this way, taking all 
the arbitrary higher q,’s to be zero. More violent changes can be 
made by taking one or more of the arbitrary higher q,’s as unreal, in 
the sense of containing the factor ~/ —1, which procedure bears some 
similarity to, and has some connection with, Heywood’s case! of the 
variate, in a hierarchial set, whose ‘“‘correlation”’ with g exceeds unity, 
and whose specific has an unreal coefficient. 

The more tests there are in the series the larger becomes the fixed 
part of g’q, and the more closely does the matrix B resemble the unit 
matrix J which makes no change in the factors. That is, the more 
tests, the less indeterminacy in the factors. 

In any case, a ‘‘ best’’ measure of the factors is required, by suitably 
“loading”’ or ‘‘weighting”’ the tests which contain them. 





é 


III. THE BEST WEIGHTINGS OF A TEAM OF TESTS TO MEASURE 
A MAN’S FACTORS? ‘ 


Let the column vector z represent the tests 
{z, Ze 2... } (8) 





where each z; represents scores in test 7 standardized to unity over the , 
population of persons tested. A second subscript indicating the t 
person can be understood but need not be printed in our present 
considerations. ' 
Similarly let the column vector f represent the factors 
{fi fo fa... } (9) . 


1 Proceedings of the Royal Society, London, A, Vol. CX XXIV, 1931, p. 498. 
2 Section III may be ignored by non-mathematical readers. 
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including both generals and specifics. Their number, when the 
specifics are included, will usually be larger than the number of tests. 
Then 


z= Mf (10) 


where M is the oblong matrix of weights or loadings, with as many 
rows as there are tests in z, and as many columns as there are factors 
in f. If the number of factors had been the same as the number of 
tests the matrix M would have been square, and the solution of 
equation (10) would have been simply 


Mz = f. (11) 


When M is oblong however we have to proceed otherwise. We have, 
if M’ is the transposed of M, 


MM’ = R, (12) 
the square matrix of correlations. Also 


RRo“=I1 (13) 
MM’'R“ =I (14) 


Premultiply the members of (10) by the members of (14). We thus 
obtain 


MM'R-z = IMf = Mf, 


and dropping the premultiplier M on both sides (although it is not 
square) we have 


M'R-'z = (estimated) f, (15) 


a matrix equation! which can be shown to give the best loadings of 
the tests z to measure the factors f. If M is square, so that M— exists, 
the equation (15) reduces, as it ought, to equation (11); for in that 


case we have 


f = M'R-z = M'(MM’')—"z = M'M"'M—"z = M-z 


In the hierarchical case, for example, equation (15) gives the Spearman 





1 Wherein the members of f are not standard variables in general. 
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weights. In a case of four tests with three generals and four specifics, 
the matrices written out at length are as follows. 


oa =o oa a — = 














21 ly Mm, Ni ki gJ 
Zo] _ l, Me Ne . ke . ° v 
23 ie ls Ms Ns . ° ks ° F 
| 4 Us mM, Ms . : . Kg | 81 (10’) 
82 
$3 
_ 4 








where I have named the factors f by letters actually applied by experi- 
menters to well-known generals, followed by specifics s numbered to 
agree with their tests. 

Equation (15) is then 


on an ~ — — 














g 1, Is ls lL Ris Rie Ris Rul 21 
v Mm, Me Ms ™% li Roi Ree Ros Row Z2 
F M1 Ne Ns 4 )||R| Rsi Rez Rss Re 23 
(Estimated) s,| = ky . . e | Rar Ra Ras Ru. | 24 | 
82 ° ke “aa (15’) 
83 ° »* ks ° 
| 84 | & ° ° Ka | 














where |R| is the determinant of the symmetrical correlation matrix 
Bl Tei 131 Ta | 


rig 1 T32 =T42 
“ 16 
” Tis Te3 1 743 (16) 
Tia Tm (Te 1 
and Ri, Riz etc. are the cofactors of |R|. Thus from (15’) we see that 
the team for measuring the first general g is 


Ri + eRe + IsR31 + Rar 




















Estimated g = rR X 21 
4 L,Riz + leRee a IsRs2 + laRae x Ze 
4 kis + I2Res a lsRs3 + laRas x fe 
4 LRis + ae lsRsa + lefeas x Se (17) 


(Estimated g is not in standard measure, the missing variance being 
the indeterminate part.) 





8 
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Equation (15’) similarly yields expressions for estimated v and 
for estimated F, as well as for the specifics. It is worth while writing 
out one of the latter because of a point of curious interest. Take 8. 


Estimated &; = a Bues + R222 at Ris23 + Rye) (18) 


the factor-loading k; being a multiplier of all the test loadings. We 
could therefore have told the best relative loadings of the tests to 
measure the specific of the first test, viz. Ri:, Rie, Ris and Ru, before 
ever making the analysis, and without knowing how many generals 
we were to use, or whether there existed any s; at all (in the case of s; 
being non-existent k, would of course be zero). This remarkable 
fact about specifics has already been published by E. B. Wilson! who 
arrived at it by considerations of vectorial geometry of hyperspace. 
It is here seen as the result of the specific being specific, 7.e., of ky; 
being the only member, other than zeros, of its column in the matrix M. 
In passing I may point out that this property of the weighting for 
specifics opens up the possibility of measuring each man’s contribution 
to the unreliability of a test, a matter to which I shall return in a 
later paper. 

Before the equations (15) can be used by experimental psychologists 
they must be translated into an arithmetical procedure which can 
be carried out by computers who are not necessarily mathematicians. 
The next section will show how this can be done. 


IV. AN ILLUSTRATIVE ARITHMETICAL EXAMPLE OF WEIGHTING A 
TEAM OF TESTS TO MEASURE A MAN’S FACTORS 


Let us suppose that four tests give, on a sample of persons, the 
correlations 


> < 2 4 





1}; . .78 .47 = .45 
2; .738 . .389 .39 
3] .47 .389 . .27 
4| .45 .39 .27 





and that a psychologist has, using the methods briefly outlined in 
section I, analysed these tests into three general factors g, v, and F, 





1“*On Resolution into Generals and Specifics.”” Proceedings of the National 
Academy of Science, Vol. XX, No. 3, March, 1934, p. 194. 
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leaving a specific in each test, which specific may be in part a chance 
factor, in part a true specific, the loadings being as follows:— 











Factors 
Tests | g v F 8: 8 8 & 
1 .66 .52 .21 .50~ . 
2 52 .66 . . 54, = M_ (20) 
3 . ; ‘ . ' 2 
4 : ao .; . . .60 
These factor loadings give the following correlations:! 
rT 1 6.69 .49 .39) 
.69 1 .388 .19 -MM'=R (21) 


49. .38 1 .27 
39 .19 .27 1 


and these are accepted as being sufficiently like the actual observed 
correlations (19). The differences between (21) and (19) are neglected 
as being within the limits of sampling error to be expected, and (21) is 
substituted for the observations (19). 

To use these tests to measure these three factors in any person 
we need to know the best loadings of the tests in teams for g, v, and 
F respectively. Equation (15) gives these loadings, and our present 
task is to put it into arithmetical form. The principal part of the 
calculation is to find R-, the reciprocal of the matrix of correlations 
(21). This is best accomplished by computation methods for cal- 
culating determinants which have been devised by Dr. A. C. Aitken 
of Edinburgh University, but which he has not yet published in full. 
I use them here with explicit acknowledgments to him. I think that 
Dr. Aitken’s methods of determinantal computation will prove of 








4 





1 In order to give this illustration an air of reality I have taken the data (slightly 
simplified) from Dr. W. P. Alexander’s experimental study “‘ Intelligence, Concrete 
and Abstract,” British Journal of Psychology Monograph Supplement, Vol. XIX, 
1935. The four tests are the Stanford Binet, Thorndike Reading Test, Spearman’s 
Analogies test in forms, and a Picture Completion Test, numbered one, three, six, 
and eleven in Alexander’s Tables XVI and XIX. They were given to one hundred 
delinquent women. The factors are g general intelligence, v the verbal factor, 
and F the practical factor. I wish to make it quite clear however that in thus 
using Dr. Alexander’s data and analysis I am not in any way whatever criticizing 
or commenting upon his work. The example is a mathematical illustration only. 
I have kept the number of tests down to four for clarity and for economy in 
printing. 
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very great assistance in experimental psychology, as in many other 
branches of science. In slab A of Table I, the matrix whose reciprocal 
is to be calculated is in the top left-hand corner of the computation. 
It is surrounded by a pattern of zeros, plus ones, and minus ones as 
shown. In the right-hand column appears the sum of each row. 
The top left-hand number! is marked as the “pivot” to be used in 
obtaining slab B of the table from slab A. 

Slab B is obtained by calculating, and setting down in order with 
their proper signs, all the tetrad-differences of slab A which have 


TaBLE I.—CoMPUTATION OF THE RECIPROCAL OF A MaTRIX OF CORRELATIONS BY 
AITKEN’s METHOD 

















— a nt ; 1.57 
6 80 38219 wi ; 1.26 
49.38 1 27 ~1 1.14 
39 4.19 .27 1 = 85 
4)‘; ' 1.00 
1 . : 1.00 
1 1.00 
1 1.00 
624 042 —.079 690 1 177 
042 .760 .079 490 ~—(, at 371 
—.079 .079 .848 390; wt 238 
(B) —.690 —.490 —.390 1 é' — 570 
1 : 1.000 
1 1.000 
1 1.000 
+1 
396 045 228 042 —.524 187 
(045 438 259 -.079 . —.524 139 
— 228 —.259 . wa —.177 
(c) —~.042 .079 | -.690 1 . 347 
524 —Ct«; 524 
524 524 
+ .524 
.327 176 —.063 .045 —.396 089 
—.176 855 —.503 —.228  . — 052 
(D) 063 | —.503 .759 —.042 277 
—~.045 | -—.228 —.042 524. 209 
396 , 396 
+ .396 
784 —.443 —.168 —.176 | —.003 
- —~.443 637 —.042 .063 215 
(B) —.168 —.042 438 —.045 183 
~.176 .063 —.045  .396 238 
+ .327 

















1 Other numbers may be chosen as pivots, but only from the matrix itself or, 
in later slabs, from the numbers derived from it. Thesignsof the tetrad-differences 


must be obtained by beginning with the pivot, and the sign of the whole determi- 
nant may be reversed. 
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the pivot of that slab as one corner of the tetrad. These tetrad- 
differences are worked right across all columns, including the right- 
hand or check column. For example, .524 is the tetrad-difference 
1 X 1 — .69 X .69. In the check column, for another example, 
.371 in slab B is the tetrad-difference 1 X 1.14 — .49 X 1.57 inslab A: 
And .371 checks as the-sum of its row. Slab C is made from slab B 
in the same way, .524 being taken as pivot. In slab C however and 
in each later slab, each tetrad-difference, before being set down, must 
be divided by the previous pivot; but as in this case the previous pivot 
was 1, this makes no change in slab C. 

In slab D however this rule does make a difference. The pivot 
in slab C is .396, and every tetrad-difference containing it, before 
being set down in slab D, must be divided by the previous pivot .524. 
Thus .176 in the first. row of slab D is 


396 X .259 — .045 x .228 
524 


The best way of doing this is to divide the top row of slab C by .524, 
write the results on a loose slip of paper, and cover the top row of 
slab C by it while slab D is being calculated, if Crelle’s Tables are 
being used. Some calculating machines, e.g. an Archimedes or a 
Brunsviga, can calculate the tetrad-difference and divide it by the 
previous pivot without any of the intermediate steps needing to be 
copied down.! 

The process goes on until there are no numbers remaining in the 
left-hand block. The number of slabs needed to do this will depend 
on the size of the original matrix. Here it happens at slab EZ, and all 
that is now required is to divide the numbers in the middle block of 
slab E by the previous pivot .327. The result is the reciprocal matrix 
to that with which we started. The two matrices are called reciprocal 
because when multiplied together (like determinants) they give the 
unit matrix, which has “‘ones” in its main diagonal and zeros every- 
where else. The last pivot (.327) is the value of the determinant 
of the original matrix. | 

From the reciprocal matrix the team loadings of the tests to give 
the best measure of the factors can be at once obtained. For example 
the best team of these four tests to measure the factor g is 


from slab C. 








1 Some workers may prefer to postpone dividing by the pivots to the end of the 
table, and then divide by the product of all the pivots, which will be different 
pivots from those here printed. The end result is the same. 
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Estimated g = .300z; + .095z. + .532z3 + .095z, 


where 2; Ze 23 24 are a man’s (standardized) scores in the four tests. 
These loadings are obtained as follows. Multiply the rows of the 
reciprocal matrix by the loadings of g in the four tests (.66, .52, .74, 


and .37), and sum the columns to obtain the above loadings of the 
tests ing. Thus 





TaBLeE II 
.66X .784 —.448 —.168 —.176 
.52X — .443 .637 —.042 .063 
.74X% —.168 —.042 .4388 —.045 
.387X —.176 .063 —.045 . 396 





.617 —.292 —.111 —.116 
— .230 .331 —.022 .033 
—.124 —.031 .324 —.033 
— .065 .023 —.017 . 147 





.098 .031 .174 .031 + .327 
Loadings . 300 .095 . 532 .095 








Estimated g (or v, or F) is not in standard measure, the missing variance 
being the indeterminate part, which grows less as more tests are used. 
From the same reciprocal matrix, by using as the row multipliers 


52, and 21, 


.66, nil, 
nil, nil, 
nil, aie 
for v, for F, (see equation 20) 


we obtain the teams of tests for these other factors. Putting the 
results together we have 


Estimated g= .3002; + .095z. + .53223 a 0952, 
Estimated v = .3532, + .58lz. — .352z; — .1532, (22) 
Estimated F= 1212, —_ .1482. —_ .20623 a 74724 


The correlations of these teams with “true” g, v, and F respectively 


can be obtained by replacing 2; 22 23 and 2, by the appropriate loadings 
from (20) and taking the square root. Thus the square of the correla- 
tion of the v-team with “‘true”’ v is 


303 X .52 + .581 X .66 — .352 K 0 — .153 XK O = .567, 
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and the correlation itself is .753, (the highest correlation of a single 
test being .66). Similarly the values .822 and .746 for the correlations 
of the g team and the F team with the ‘“‘true” values of those factors 
can be found. 

If the reciprocal matrix is not otherwise required, the complete 
calculation as in Table I is unnecessary. A shorter method will be 
indicated in a note to Table III. Further short cuts are possible if 
only relative weights are wanted and the team-correlation is not 
required, but they are hardly worth while. 


V. VOCATIONAL GUIDANCE WITH THE AID OF FACTORS! 


In order to guide a candidate, whose factors we have measured, 
into a suitable occupation 2), we must know the way in which these 
factors enter into this occupational ability, that is, we must know 
that the analysis of the occupation is 


Zo = log + mov + NoF + koso (23) 

for the case of three factors, or generally 
Zo = pof (23a) 
where 7,’ is the row-vector of the factor loadings in the occupation, 7.e. 
po =([lo mo m-:? J (24) 


We could then insert the values of a candidate’s factors into equation 
(23) or (23a) and find an estimate of his Zo, his ability in that occupa- 
tion, an estimate which would be the better, the smaller ko, the loading 
of the occupational specific, for we do not know the value of so for 
the candidate. 

Unless we are content with a subjective guess at them, we must 
have obtained the factor-loadings of the occupation (lp mo . . . etc.) 
by giving to a number of persons already engaged in that occupation 
the same tests which we have used in measuring the new candidate’s 
factors, or at least similar tests. It is clear therefore that both the 
factor-loadings of the occupation, and the candidate’s factors, are 
functions of the scores in actual tests and of the correlations between 
tests. The factors therefore can be eliminated in equation (23) 
or (23a), and the estimate of the candidate’s occupational ability 
expressed in actual test-scores and correlations. Indeed, if in equation 
(23a) we substitute the value of f from equation (15), we have 





1 Section V may be omitted by non-mathematical readers. 





— 











Technique in Factorial Analysis 49 


Estimated zo = po’ M’'R-z (25) 


where f the column of factors no longer appears. Moreover, a little 
consideration shows that equation (25) is none other than the ordinary 
regression equation, which we might have written down at once 
without measuring the candidate’s factors at all. For we have 


po M’ = [ror Tor Tos * * * |] = To’ say (26) 
and then 
[Ru Re. «| 
po'M'R-1 is ror ron © = + | + |R 








= a 


so that clearly equation (25) may be written 


Zo (2! 


— 0 (27) 








But this is the ordinary regression equation, giving estimated 2p. 
If A is the determinant of all the correlations of tests and the occupa- 
tion with one another, (27) can be written 
; Aoi Aoe Aos 
Estimated z) = —z, — —z —Z3 — --> 28 

a Dow? * Ao ,78 (28) 
and equation (28) is identical with that derived from equation (23). 
An arithmetical illustration of this is given in the next section. 


VI. A MODEL CALCULATION ILLUSTRATING VOCATIONAL GUIDANCE 
WITH THE AID OF FACTORS 


Let us imagine an occupation 2, whose factorial analysis is 


29 = .550g + .450v + .600F + .368s (29) 


and inquire how the candidate, whose factors we measured in section 4, 
will suit this occupation. His g, v, and F are given by equations (22), 
which will give numerical values for these factors in that candidate 
if his test-scores 21, 22, 23, 24 are inserted. These values of his factors, 
inserted into equation (29), will, if we ignore his s) which we do not 
know, give a value for his Zo, his predicted performance in this occupa- 
tion. If we substitute from equations (22) in equation (29), and 
suppress 8, we obtain 





50 The Journal of Educational Psychology 


Estimated z) = (.550 X .300 + .450 * .353 + .600 X .121)z 
(.550 X .095 + .450 X .581 — .600 X .148)z- 
(.550 X .582 — .450 X .3852 — .600 X .206)zs 
(.550 X .095 — .450 X .153 + .600 X .747)z. 
= .396z, + .225z2. + .Ollz; + .432z, (30) 


At the end therefore of our factorial analysis of the tests and the 
occupation, and our measurement of a candidate’s factors by loaded 
teams of the tests, we arrive at equation (30), which we could have 
written down direct, without any calculation of the candidate’s factors 
at all, for it is nothing but the regression equation for predicting the 
occupation direct from the tests. 

For from the loadings in equations (29) and (1) we know the cor- 
relations between the. occupation and the four tests are .72, .58, .41, 
and .63 respectively—for example, the correlation between the occupa- 
tion and zis .550 X .66 

+ .450 X al = .72 

+ .600 X .21 
Indeed, we may well have known these correlations by actual observa- 
tion before we made the factorial analysis of the occupation. That 
case (in which the same tests are used for analysing the man and for 
analysing the occupation) we shall return to presently. Meanwhile, 
even if different tests were used for analysing the occupation, we can 
reconstruct these correlations as above, and we then have the whole 
matrix of the correlations which arise from the factors, namely: 


| Zo 21 22 23 24 
2zo| 1 .72 .58 .41 = «.63 
2,|.72 1 .69 .49 .39 
z.| .58 .69 1 .38 «19 (31) 
z;| .41 .49 .388 1 .27 
z,| .638 .39 .19 .27 1 


To this the ordinary regression calculation can be applied. It also 
is best carried out by Aitken’s methods, and this example will serve 
as a model for introducing these to psychologists. ‘The whole is set 
out in Table III. In the top left-hand corner, as before, appears the 
square matrix of the intercorrelations of the tests, excluding the 
occupation or “criterion.” On its right appears the same diagonal 
pattern of “‘minus ones.” Below it however instead of the pattern 
of “plus ones” of the former calculation, appears only the row of 
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correlations of the tests with the occupation or criterion. The check 
column appears as before on the right. Thereafter the calculation 
proceeds by ‘‘ pivotal condensation” exactly as before, and the regres- 
sion coefficients .391, .223, .017, .431 come out at the bottom. They 
are identical with the coefficients of equation (30) obtained by measur- 
ing the candidate’s factors, the actual differences in the third place of 
decimals being arithmetical only. Both methods give the same team- 
correlation with the occupation, namely, .83, obtained by inserting 
the occupation correlations instead of the z’s in equation (30) and 
taking the square root. 


TasBLeE III.—CompuTaTION oF A REGRESSION EQuaTION BY AITKEN’Ss METHOD 




















1 .69 .49 .39 | —1 ; ; ; 1.57 
.69 1 .38 .19 ; — 1 ‘ , 1.26 
.49 .38 1 .27 ; . — 1 ; 1.14 
.39 19 .27 1 ; ; ' — 1 . 85 
.72 .58 «4:1 . 63 ’ ; ‘ ; 2.34 
-624 .042 —.079; .690 -— 1 ‘ , .177 
.042 .760 .079| .490 ‘ — 1 , .371 
—.079 .079 .848| .390 , ; — 1 . 238 
.083 .057 .349| .720 : : 1.210 
+ 1 

-396 .045| .228 .042 —.524 ¥ . 187 
.045 .4388| .259 —.079 , — .524 .139 
.026 .189| .320 .083 : ' .619 

+ .§24 
-827| .176 —.063 .045 —.396 .089 
.141| .231 .061 .026 ; .459 

+ .396 
.128 .073 .005 .141 . 347 

+ .327 


.391 . 223 .017 .431 | 1.06 
Regression coefficients. 











Nore.—This form of calculation can also be used to obtain, one at a time, the 
best teams for g,v, and F. For this purpose the loadings of one of these factors in 
the four tests must be substituted for the correlations of the occupation (.72, .58, 
.41, .63) in the bottom row of the first slab. And other rows, the loadings of the 
other factors, can be added beneath, and the three calculations carried on 
simultaneously, three rows of regression coefficients coming out at the bottom. 


It is clear therefore that there is no need to measure the candidate’s 
factors in order to predict his most probable success in the occupation. 
The only necessity for factors in the above calculation is in order to 
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deduce the correlation of the occupation with the tests, on the assump- 
ton that the occupation had been analysed in some other way into fac- 
tors, and not by means of the same tests as are given to the candidate. 

In the case where the same tests have been given to the candidate 
as have been used for analysing the occupation (clearly the most 
desirable plan) all necessity for the use of factors disappears, for in 
that case the actual observed values of all the correlations in the matrix 
(31) are known, and the regression calculation can be carried out 
without any analysis either of tests or occupation into factors, and the 
same result arrived at. True, the observed correlations are not quite 
the same as those in (31), but they are at least as good and presumably 
better, while if the factorial analysis was carried on until (31) was 
within sampling error distance of the observations the differences will 
be slight. The observed correlations between the tests were given in 
(19). If we add observed correlations of the occupation with the 
tests we have the observed matrix 


Z0 Zi 22. 23 24 





zo| 1. .75 .56 .39 .60 

ai|.75 1 .73 .47 = «.45 (32) 
z.| .56 .73 1 «.39 = .39 

zz| .39 .47 .389 1 .27 

zz| .60 .45 .39 .27 1 





If we subject these observed correlations to a calculation like that of 
Table III we obtain 


Estimated zo = .604z2,; — .021z. + .0252; + .330z, (33) 


and a team-correlation with the occupation of .81. This team differs 
a good deal from that of equations (30) or Table III, and gives a 
slightly lower correlation. But it is based on the actual observations 
and not on an interpolated theory. Moreover, its correlation is 
literally the highest that can be obtained with these tests. The slightly 
higher correlation .83 of the team of equation (30) is the correlation 
that would be obtained if the matrix of correlations were as in (31). 
But in truth the correlations were as in (32), and when we apply the 
weights of team (30) to be observed correlations of matrix (32) we 
obtain, not .83 but .785. : 





exc 
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VII. THE ADVANTAGES AND DISADVANTAGES OF USING FACTORS IN 
VOCATIONAL GUIDANCE 


The preceding sections make it quite clear that, when vocational 
guidance proceeds by giving to a candidate a number of tests which 
have previously been given to persons already engaged in the occupa- 
tion, the use of factors has no mathematical justification whatever. 

If the tests and the criterion are conceived of as radiating straight 
lines, fixed at such angles from one another that the cosines of those 
angles are the correlations, then the candidate’s scores in the n tests 
fix his representative point in the n-space of the tests (namely the 
point whose projections on to the test-lines give his scores). This 
enables his criterion or occupation value to be calculated, with 
exactitude if the criterion line is also in this n-space, only in projection 
on the n-space if the criterion line is outside it—but no more is in this 
case possible by whatever means. The method of the regression 
equation, Table III, takes us to this goal without the use of any 
rectangular axes in our space. Indeed it may be looked upon as a 
method which uses the test-lines themselves as a kind of oblique 
coordinate axes. 

The factor method replaces the oblique coordinates of the test- 
lines by rectangular coordinates, and then finds from these the pro- 
jection of the candidate’s criterion value on the test space. If the 
rectangular coordinates completely and exactly reproduce the point 
given by the oblique coordinates, the end result will be the same. 
That is the best that can be hoped for, that the roundabout method 
may be as good as the direct. If an incomplete set of rectangular 
axes is used, the person’s representative point will not be defined but 
only a space in which it somewhere lies, so that the indeterminacy of 
the prediction will be heightened. 

To change to a simpler analogy, the regression equation method is 
like an American buying American goods from another American in 
dollars. The factor method is as though the buyer first unnecessarily 
changed his dollars into francs, and then asked the seller to quote him 
apriceinfrancs. He is sure to lose on the roundabout method, proba- 
bly lose money and certainly lose time. 

The factor method will however be useful when a medium of 
exchange is essential, as when in our analogy the buyer has one coinage 
and the seller, a foreigner, has another. That is, when one set of 
tests has been used for analysing the occupation and a different set 





54 The Journal of Educational Psychology 


has been given to the candidate. This is not an ideal situation, but 
it is one which will frequently arise. It is especially likely to arise, 
for example, when we wish to practice not merely vocational selection 
but vocational guidance. In selecting men for a given occupation we 
are almost certain to use the same tests on the candidates as on the 
eccupation. But if we wish to advise the rejected men about other 
occupations, we are less likely to be in this mathematically most 
favourable position, and we may have to use factors as go-betweens. 


VIII. SUMMARY 


An orthogonal matrix is arrived at (equations 4 and 7) which 
rotates the axes of a factorial analysis (into generals and specifics) to 
new positions which nevertheless fulfil all the conditions, thus illustrat- 
ing the indeterminacy of all analyses which leave a different specific 
in each test. A formula is obtained (equation 15, and see also 15’ and 
17) which gives the best loadings of a team of tests to measure each 
factor. In Table I and Table II this formula is illustrated by an 
arithmetical calculation, using Aitken’s methods of determinantal 
computation. In the remainder of the article it is proved and illus- 
trated that the direct method of the regression equation gives a predic- 
tion of a candidate’s success in an occupation, which is either exactly 
the same as or better than that given by a factor method; but it is 
shown that factors may in certain cases form a useful ‘‘medium of 
exchange.”’ In no case is it necessary to measure the candidate’s 


factors. Table III, which illustrates the use of Aitken’s methods in 
calculating regression coefficients, will probably be useful to computers 
in psychological laboratories. 











EMOTIONAL CORRELATES OF ERRORS IN 
LEARNING* 


HAROLD D. CARTER 
Institute of Child Welfare, University of California 


It is commonly believed that pleasant things are more easily 
learned and remembered than unpleasant things. If this is true, 
study of the errors made in the attempt to learn affectively-toned 
material should furnish convincing evidence of the trend. The com- 
moner procedure has been to study not errors, but positive learning, 
emphasis being put upon amount of material learned, or relative 
excellence of performance, in relation to pleasantness and unpleasant- 
ness. While the results of the two procedures must naturally agree 
with respect to general trends, each has also its own distinctive 
contribution. In this study, over eight thousand errors made by 
children in the learning of pleasant, indifferent, and unpleasant words 
were tabulated and analyzed. Supplementary data are also pre- 
sented, based on correct responses. 

Since results may vary for learning of different kinds of material, 
it will be necessary to establish basic facts within a limited portion 
of the field. The use of words as the material to be learned seems to 
offer several outstanding advantages. First, a wide variety of words 
can be found which, in their connotation, are distinctly pleasant, 
indifferent, or unpleasant, respectively. Second, because the ideas, 
things, and attitudes of everyday life are almost universally referred 
to in verbal terms, it is possible to deal with a wide variety of emotional 
material indirectly through the medium of words. Third, because 
of this catholicity of reference, results obtained from the use of verbal 
stimuli may have rather general significance. 


THE LITERATURE 


Reviews** of the literature dealing with the general problem 
reveal great inconsistency in the results obtained by different investi- 
gators. Part of this may be attributed to lack of comparability of 
the studies conducted. However, even when the field is restricted to 
studies of learning of verbal material, there is far from complete 
harmony. Tait'! found that pleasant words were learned most 





* This report deals with certain limited aspects of a cumulative investigation 
which the writer is conducting at the Institute of Child Welfare. For criticism of 
the manuscript the writer is indebted to Dr. H. E. Jones and Dr. H. 8. Conrad. 
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efficiently, unpleasant ones were intermediate, and indifferent ones 
were learned worst of all. In Tolman’s!* experiment, the indifferent 
words were learned more rapidly than the unpleasant ones. Accord- 
ing to Whately Smith,* affectively-toned words may be better learned, 
or worse learned, than neutral words, depending on the quality of 
the affective tone. Harold Jones,® like Smith, concluded that emo- 
tional intensity may affect learning either adversely or favorably, 
depending on whether the emotion is positive or negative. The data 
of Chaney and Lauer did not reveal significant differences in learning 
the three types of material. Thomson!* showed that pleasant words 
were learned and remembered better than unpleasant ones. Using 
a recognition method, Lynch’ found pleasant words remembered best, 
indifferent ones second, and unpleasant words poorest remembered, 
but this finding was limited to results obtained when sufficient inter- 
vals elapsed between learning and recognition. Cason‘ found that 
pleasant words were best learned, indifferent ones were second, and 
unpleasant ones third, but in retention the order was changed to J, 
P, U. Stagner’ found an order for learning which was P, J, U. 
Balken! demonstrated that differences in learning pleasant, indifferent, 
and unpleasant words in her material were not significant. White and 
Ratliff! found that pleasant words were not better learned than 
unpleasant ones, but were better recalled when the criterion was 
changed to require correct position in series. 

The reasons for the diversity of results are not completely under- 
stood, since the studies differ in selection of words, in selection of 
subjects, in tasks performed, and in other aspects of procedure. There 
is apparently some tendency to find more efficient learning and reten- 
tion of pleasant material than of indifferent or unpleasant material, 
but this finding is far from universal. The greatest disagreement 
concerns the relative position of indifferent and unpleasant material. 
Since the published results are somewhat inconsistent, it seems desir- 
able to present further material contributing to this field of research. 
As previously noted, the present report places emphasis upon detailed 
consideration of the errors made in learning. 


THE DATA 


Data were secured from a group of about one hundred children, 
who were tested on three separate occasions, at intervals of about 
six months. At the time of first testing, the children were in the sixth 
and seventh grades of the public schools of Oakland, California. An 
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earlier publication® dealing with results of the first series of data 
included a detailed account of the procedure; we shall therefore present 
here an abbreviated description, with occasional amplification of 
relevant details. 

As test material, single words were used, rather than more complex 
material, for greater ease in experimentation, and because it seemed 
desirable to have as little interference as possible with the children’s 
normal associative responses. The words which seemed most likely 
to be suitable representatives of the pleasant, indifferent, and unpleas- 
ant categories were selected from a larger list, for use in the experiment. 
All difficult words were eliminated, and some attempt was made to 
include in the three categories words comparable with respect to such 
features as concreteness, length, and familiarity. At the same time, 
the words were so selected as to be greatly different in emotional 
connotations. 

Ratings of the pleasantness and unpleasantness of the words were 
secured from the children who served as subjects in the learning 
experiment. In order to provide a natural task for making the judg- 
ments, the words were typewritten upon cards, which the subjects 
sorted out into five classes, class one including only very pleasant 
words, class five only very unpleasant words, etc. The mean ratings 
obtained for words in each category are shown in the tables. If this 
criterion of emotional tone be accepted, the words are distinctly 
grouped in the three desired classes. 

At each of the three periods of data collection, the subjects were 
given a free association test, in which records were made of association 
times and word responses to the stimulus words. In administration 
of these tests, the pleasant, indifferent, and unpleasant words were 
distributed evenly throughout the lists; the effects of position in series 
were eliminated by systematic variation of the order of the words for 
different subjects. The experimental list of words was always pre- 
ceded and followed by three or four buffer words not included in the 
analysis of results. The mean association times for the words in the 
three categories are shown in the tables. If we accept the usual 
explanation of long association times, these data furnish additional 
indications of the emotional stimulus value of the words. 

The children were tested for the learning of the words by a paired 
associates technique. The task was to recall the appropriate word 
when a picture was presented. Each individual was given five trials 
(prompting method), and the number of correct recalls was taken as 
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the measure of learning. Failures to recall the correct word were 
noted, and false responses were identified on the record sheets. The 
pictures and words were shown together for five seconds on the first 
exposure; in the succeeding five learning trials, the subjects were 
allowed five seconds in which to recall the words in response to the 
pictures; in case of failure, the words were again shown for five seconds, 
but when a correct response occurred, the experimenter passed on 
to the next picture at once. From the results secured in this way, a 
record of correct responses, false responses, and failures to respond 
was obtained for each person, for each word separately. By systemati- 
cally altering the order of presentation, possible effects of position in 
series were eliminated from the group results. 

The first series of data were collected in the spring of 1933. The 
words used in Series I were as follows: ‘“‘Pleasant’’ words, mother, 
father, candy, movies, love, marry, dance, and kiss; “indifferent”? words, 
pencil, enter, notice, fill, glass, turn, middle, and lid; ‘‘unpleasant”’ 
words, pimple, fight, bleed, fear, insult, garbage, cheat, and vomit. 

The second series of data were collected in the fall of 1933. The 
words used were as follows: ‘‘Pleasant”’ words, circus, pie, brother, 
play, sister, wedding, admire, and hug; “indifferent” words, read, 
walk, pen, sand, hat, trade, center, and point; “unpleasant” words, 
lousy, abuse, fright, blister, warts, liar, stink, and kill. 

The third series of data were collected in the spring of 1934. The 
words wére as follows: “‘Pleasant’’ words, happy, cake, athlete, radio, 
beautiful, clever, popular, and captain; ‘indifferent’? words, workman, 
carpet, rocks, number, closed, curious, uneven, and cloudy; ‘‘unpleasant”’ 
words, clumsy, ashamed, stupid, sissy, scabs, dirty, unemployed, and 
coward. 

The procedures described above were used in each of the three 
stages of data collection. The same group of children, including about 
fifty boys and fifty girls, served as subjects in all three parts of the 
study. 


RESULTS 


In the present study, all the errors made by the children in learning 
the words are considered in detail. An alternative procedure would 
be to study the children’s learning scores based upon number of cor- 
rect responses. The earlier report® based upon the first of the three 
series of data, employed this alternative procedure. Since the total 
number of responses per child is constant, some of the general facts 
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about learning are revealed equally well through study of errors 
and through study of correct responses. Some specific facts under- 
lying the trends are shown, however, only by the tabulation of errors. 


TRENDS IN ERRORS 


In the learning situation, the children made two kinds of errors 
when they did not correctly recall the words associated with the 
pictures: first, they often failed to respond at all, and waited to be 
prompted; second, they responded by giving the wrong word. The 
two kinds of errors were tabulated separately, and are shown in 
columns five and six of Table I. 

Table I presents a summary of several kinds of material. The 
first column indicates the section of data to which each row of entries 
refers. In the second column are given the mean ratings of the words 


TaBLE I.—FREQUENCY OF OCCURRENCE OF ERRORS IN THE LEARNING OF PLEASANT, 
INDIFFERENT, AND UNPLEASANT WorpDs 

















Mean Mean Used in Replaced , Sum of 
prt place of : Failure to 
P-U | association by incor- all 
rating time —- rect words respond errors 
words 
(1) (2) (3) (4) (5) (6) (7) 
**Pleasant’’ words. 
Series I, eight words........ 1.66 4.50 353 298 668 966 
Series II, eight words. ..... 1.86 3.49 167 144 516 660 
Series III, eight words..... 1.72 3.28 100 119 455 574 
Average or totals*........... 1.75 3.76 620 561 1639 2200 
**Indifferent’’ words. 
Series I, eight words....... 2.72 4.52 209 432 1044 1476 
Series II, eight words...... 2.50 3.46 113 187 631 818 
Series III, eight words..... 3.01 3.61 89 187 625 812 
Average or total............. 2.74 3.86 411 806 2300 3106 
“*Unpleasant’’ words. 
Series I, eight words....... 4.34 5.16 315 339 886 1225 
Series II, eight words....... 4.29 4.03 130 179 653 832 
Series III, eight words..... 4.19 3.62 124 147 520 667 
Average or total............. 4.27 4.27 569 665 2059 2724 
| ee re rer bees 1600 2032 5998 8030 























* The values in columns two and three are means, while those in columns four, five, six, and 


seven aresums. The entries in column seven are the sums of the corresponding entries in columns 
five and six. 
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on a scale from one (very pleasant) to five (very unpleasant). The 
third column presents the mean association time in seconds, required 
by the children in giving free-association responses to the words. 
The fourth column indicates how frequently the words of each category 
were used in the false responses. The fifth column gives the fre- 
quencies with which words of each type were replaced by incorrect 
words. The sixth column presents the total numbers of failures to 
respond, for each category of words. The final column gives the sums 
of failures to respond and false responses, for each category. 

Study of Table I shows that the indifferent words were most 
frequently forgotten, or replaced by incorrect words. The unpleasant 
words are second, and the fewest errors are made in learning the 
pleasant words. This is true both for false responses (Column 5) 
and for failures to respond (Column 6). 

The most clear-cut differences are demonstrated in the first series 
of data, probably because the greatest number of errors was made 
in the first series of tests. The naive subjects thus provided a larger 
mass of data, which should logically yield more reliable evidence of 
trends. In the later series, the level of efficiency was in general 
improved. The tendency toward errorless performance makes it 
increasingly difficult to demonstrate efficiency-differences when work- 
ing with the same subjects after they become more mature and more 
practiced. 

The results of Series II and III are not exactly the same, but in 
essential respects are similar to those of Series I. The minor incon- 
sistencies indicate the dangers of over-generalizing from limited sam- 
plings of material. These inconsistencies suggest that occasionally, 
contrary to the general trend, some small samples of indifferent words 
may be better learned than some small samples of unpleasant words. 
One might emphasize the fact that for use in the present argument the 
results for single words in the different groups cannot be compared with 
great certainty, even when the figures used are mean learning scores 
based upon groups of learners. It is only when reasonably large 
groups of words are compared that interpretable differences are 
revealed. In the present argument, we are not interested in the 
specific attributes of single words.* However, when groups of words 
are selected and placed together because they are pleasant, there is 





* Certain aspects of the data are not investigated in this paper. Since the 
study is cumulative, some of the problems can be considered most advantageously 
in later publications, when the mass of experimental data is increased. 
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probability that this general attribute, pleasantness, becomes relatively 
important in description of the group, in comparison with other 
characteristics which might happen to apply to specific words. 

Although details are different for the three series of data, some 
general consistencies may be noted, and the safest generalizations 
may be drawn from the total mass of material. In all, eight thousand 
thirty errors were tabulated, showing the frequency with which various 
words of three categories were not correctly recalled. These errors 
in learning numbered three thousand one hundred six for indifferent 
words, two thousand seven hundred twenty-four for unpleasant words, 
and two thousand two hundred for pleasant words. 

Published studies are in great disagreement concerning the relative 
positions of indifferent and unpleasant words on the scale of efficiency 
in learning. The difference between these two types of words is 
evidently a relatively small one, which is adequately revealed only 
through study of a large body of data. The present results indicate 
that the unpleasant are better learned than the indifferent. But the 
detailed rating material shows that indifferent words, as ordinarily 
chosen, tend to be mildly pleasant in some instances, and different 
selections of them would yield variable experimental results. So-called 
theories of active forgetting, or repression of the unpleasant, are not 
clearly supported by the comparison of results for unpleasant and 
indifferent words. However, the partial correctness of such hypotheses 
is suggested by the tabulations of false responses, which show that 
incorrect pleasant words are often substituted for the correct unpleas- 
ant words when the latter are not remembered. 





NATURE OF THE FALSE RESPONSES 


The fourth column of figures in each table yields information 
concerning the nature of the mistakes made. When the children 
gave the wrong words, they tended to give one of the other words in 
the experimental series. Since such false responses could belong in 
any of the three categories, tabulation of the frequencies with which 
each pleasant, indifferent, and unpleasant word was so used should be 
instructive. On the average, each word would by chance be used 
incorrectly an equal number of times. The results for the three 
separate series are givenin Table I. These figures show that the words 
substituted for the correct words are by no means randomly selected. 
Words in the experimental series were so used one thousand six hundred 
times. Considering all the data, pleasant words were so used six hun- 
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dred twenty times, unpleasant words five hundred sixty-nine times, and 
indifferent words four hundred eleven times. These results should 
be considered in relation to the nature of the words the children were 
attempting to recall. The correct responses would have been most 
often indifferent, next most often unpleasant, and least often pleasant. 
The words used incorrectly thus do not in general represent the same 
categories which would furnish correct responses in those instances. 
The numerical data lead directly to the inference that there 1s some 
tendency to substitute incorrect pleasant words for correct unpleasant or 
indifferent words, in the initial stages of learning. 


STATISTICAL SIGNIFICANCE OF THE DIFFERENCES 


In an earlier article* dealing with the data of the first series of 
tests, it was shown that the differences in learning were statistically 
reliable. The earlier study was based upon learning scores secured 
by summation of correct responses to words in each of the three cate- 
gories. From this it follows that the differences between frequencies 
of errors for the three categories must also be significant, since for 
each child the total number of errors plus the total number of correct 
responses is a constant. 

The present study of errors has been extended to include two 
additional series of data. The three series of data have been in 
essential agreement, hence we can infer that the differences here, 
based upon the total mass of data, are likewise significant. To avoid 
duplication, the proof is not presented here, but as a matter of fact 
the critical ratios were all found to be larger when all three series of 
data were included in the analysis. 


ASSOCIATION TIME DATA 


Association times are often used as indications of emotional 
connotations of words. Assuming that words which stimulate com- 
plexes, especially unpleasant ones, tend to have long association 
times, we should expect the ‘‘unpleasant’’ words to have longer 
association times. In general, this is the case, as is shown in Table I. 
In association times there is not much difference between the pleasant 
and indifferent groups of words. The words within each category 
were rather variable with respect to association times—a fact which 
suggested further sub-division according to this second criterion. 
For this purpose, the eight, words in each category (each series of data 
was treated separately) were divided into two groups: The first group 
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consisted of the four words with association times longer than the 
median; the second group consisted of the four words with association 
times shorter than the median. The summary of results in Table II, 
for the words of Series I, II, and III, permits both inter- and intra- 
category comparisons. 

The results indicate no systematic relationship between associa- 
tion times and errors, apart from that relationship which is indicated 
also by the pleasantness-unpleasantness ratings. There is some 
slight evidence that pleasant and indifferent words are less efficiently 


TasiLe II].—Summary oF Data REARRANGED TO SHOW TRENDS OF ERRORS IN 
RELATION TO ASSOCIATION TIMES 



































Words with short association times Words with long association times 
Used Used 
my ea wary ee 
associ-| place associ-| place 
. of - of 
ation | of cor- ation |\of cor- 
: errors ° errors 
time | rect time | rect 
words words 
**Pleasant’’ words. “Pleasant” words. 
Ser 3.50 181 426 Di Miss acesaesge eer 5.50 172 540 
Ph dasccenessens 2.92 65 276 a 4.06 102 384 
ee 2.72 56 300 SN Misa ncdcnveces 3.84 44 274 
Average or total*........ 3.05 302 | 1002 | Average or total*....... 4.47 318 | 1198 
**Indifferent’’ words. ““Indifferent’’ words. 
Ee 3.72 95 678 ER ee 5.32 114 798 
PM tescescteadwae 3.04 56 421 Pe 3.88 57 397 
PE Easenceecsesws 3.02 44 372 Dt Miacacasueeeas 4.20 45 440 
Average or total......... 3.26 195 | 1471 | Average or total........ 4.47 216 | 1635 
**Unpleasant’’ words. ““Unpleasant”’ words. 
Pe cstevdeaheneede 4.22 196 583 DM ivccseuseseees 6.11 119 642 
Di ccissseseoace 3.52 45 442 PE cccccenen wan 4.54 85 390 
PE 6 ees cceveness 3.21 57 375 PONS coencesanen 4.04 67 292 
Average or total......... 3.65 298 | 1400 | Average or total........ 4.89 271 | 1324 
All words. All words. 
See 3.81 472 | 1687 > 5.64 405 | 1980 
Dh tiseacadeoode 3.16 166 | 1139 DEE Gecccatsoens 4.16 244 | 1171 
Ls er 2.98 157 | 1047 Pd ckecceceoes 4.03 156 | 1006 
Average or total......... 3.32 795 | 3873 | Average or total........ 4.61 805 | 4157 


























* The means and totals, properly combined, may be compared with corresponding values in 
Table I. 
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learned when the association times are long, but this finding does not 
apply to the unpleasant words. Within certain limits, as shown by 
inter-category comparisons, the association times furnish evidence 
in support of the ratings. But whatever the association times meas- 
ure independently of the ratings, it appears not to be covariant with 
the differential error trends. When the words are paired to hold 
subjectively-rated affective factors constant, the association times 
become unimportant in relation to the error trends. 


INTERPRETATION 


In considering the results reported in this paper, we must content 
ourselves with limited generalizations. We have been largely con- 
cerned with establishing basic facts, and setting up principles of 
experimentation which will lead to more understandable results. 
The study has been planned in such a way that the limitations and 
weaknesses in such data are revealed, as well as the significant trends. 
It appears to us that the method of cumulative extension of mass of 
material gives a firmer basis for inference as to the true cause of the 
findings. : 

First, we recognize that the selection of words is very important. 
Differences of the sort here demonstrated could conceivably arise 
from other causes than the emotional attributes of the words. This 
argument has frequently been raised, and it must be admitted as a 
possibility. It has been suggested, for example, that other attributes 
of the words, such as familiarity, may explain the differences. This 
argument loses force when applied to the present data, because the 
three classes of words are specifically chosen to be different in emo- 
tionality, while differences in familiarity are kept minimal. When 
words are so chosen, and the list is cumulatively extended, it is highly . 
unlikely that by chance differences in familiarity or other aspects of 
the words will happen to apply in the same way to all the selections 
of words. 

There may be other attributes of words, inevitably associated 
with pleasantness and unpleasantness, which indirectly bring about 
differences in efficiency of learning. For example, unpleasant material 
may be avoided in many phases of experience. This might lead to 
a situation in which unpleasant material, extensively sampled, is 
always less familiar than pleasant material. But such unfamiliarity 
would result from the unpleasantness, hence the interpretation of such 
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a finding, even if it were demonstrated, would not be the one implied 
in the current literature. 

A second fundamental consideration has to do with the nature 
of the task performed by the children. Errors, slips, and mistakes 
of various kinds are not made when material is very efficiently handled. 
Hence tasks which are too easy are not suitable for study of factors 
affecting efficiency. The recognition method suffers from this defect, 
unless care is taken to introduce difficulty through some kind of com- 
plication or extension of material. But in discussing the suitability 
of the task, we must also consider just what the subject does. Pre- 
sumably, the theory of better learning of pleasant words and easier 
forgetting of unpleasant ones is more adequately tested by a task 
in which the subject is required to recall the specific emotionally- 
toned words. Probably extensive researches would show agreement 
in trends of results secured by a wide variety of learning methods, 
but when studies are limited in extent and scope, such agreements may 
be obscured by differences dependent upon the particular procedures 
used. 

The present data do not furnish direct evidence on emotional 
factors in delayed memory, since only immediate recall has been tested. 
However, the fact that memory is so closely related to adequacy of 
original learning leads to the inference that similar trends would be 
shown in delayed-recall tests. The trends might be more pronounced 
in delayed recall, as a result of the decreased efficiency of performance. 
It would be theoretically interesting if adequate studies could demon- 
strate the operation of emotional factors in delayed recall, when they 
are absent in immediate recall. 

In further work, it is planned to investigate the relationships 
after further classification of the words with respect to their galvano- 
metric potency. The average galvanometer deflection caused in 
subjects by each word would furnish additional evidence of emotional 
stimulus-value of the words.* However, the difficult requirements 
of an adequate study of this sort are often overlooked. We have 
repeatedly noted the need for comparison of scores based upon suitably 
large samplings of words. The further analysis, which would involve 





*It should be mentioned that one can confidently expect to secure galvanom- 
eter deflections which are in general larger for the ‘“‘unpleasant’’ words than for 
the ‘‘pleasant”’ words, etc. In an earlier publication® it was shown that galvanom- 
eter deflections were largest for the unpleasant words, second largest for pleasant 
words, and smallest for indifferent words, using the data of Series I. 








66 The Journal of Educational Psychology 


some subdivision of the material, is outside the scope of the present 
report, and will be presented in another article. This further analysis 
should give our study added significance in relation to the theories 
advanced by Whately Smith,® and supported by the more adequately- 
controlled experiment by Harold Jones.*® 


SUMMARY AND CONCLUSIONS 


A cumulative study has been carried out, bearing on the theory 
that learning is more efficient for pleasant than for unpleasant material. 
A group of about one hundred children learned selections of pleasant, 
indifferent, and unpleasant words, at three different times, with inter- 
vening intervals of six months. The task required the recall of the 
words, in response to pictures presented as paired associates. Over 
eight thousand errors made in the process of learning were tabulated 
and analyzed. These data are supplemented by the earlier published 
analysis of children’s scores based upon number of correct responses. 
The data indicate the following tentative conclusions: 

1. The general trends of results obtained in repetitions of the 
experiment were in essential agreement. However, the variability 
in efficiency of recall for different words within the categories suggests 
that for purposes of the present argument comparisons must be based 
upon larger groups of words, rather than upon single words. 

2. Pleasant words are better learned than are either indifferent 
or unpleasant words. This finding is statistically reliable, and it is 
shown in each of the three series of data. 

3. Unpleasant words tend to be better learned than indifferent 
words. In one series of data this finding is reversed, but the total 
mass of experimental data shows a statistically reliable difference in 
favor of the unpleasant words, as compared with the indifferent. 

4. Inter-category comparisons show that association times will 
furnish independent support for the pleasantness-unpleasantness rat- 
ings as indications of factors affecting learning. Intracategory com- 
parisons show that association times become unimportant in relation 
to learning when the subjectively-rated emotional value of the words 
is kept constant. 

5. The errors made in learning show interesting trends in relation 
to affective torfe of words. Considering all the results, most errors 
are made in learning indifferent words, and fewest in learning pleasant 
words. In false responses, there is a tendency for the child to replace 
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the correct unpleasant or indifferent word with an incorrect pleasant 
one. 


10. 


11. 


12. 


13. 


14. 
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VALIDITY OF TEST ITEMS 


FRANCES SWINEFORD 


Research Assistant, University of Chicago 


INTRODUCTION 


One view of the validity of a test item is its correlation with some 
valid criterion which is independent of the test. If the test as a whole, 
however, is known to be valid, then the rating of any item may be 
measured by its correlation with the test of which it is an element. 
Such a correlation may be thought of not only as a measure of the 
consistency between the item and the test total, but also as the validity 
of the item, since it shows the extent to which the item measures what- 
ever function the test as a whole measures. 

The purpose of this study! is to compare eight selected methods 
of determining test-item validity, and any satisfactory criterion, 
independent or internal, will suffice. The internal criterion has been 
adopted, because the whole test upon which the study is based is the 
best measure of Spearman’s “‘g” yet available. The eight methods 
will be compared for general agreement with each other, for reliability, 
for applicability to different test situations, and for ease of computation. 


MATERIAL 


The test material consists of three non-language tests of Spear- 
man’s “‘g’’: (1) Spearman’s Visual Perception Test, Part III, (2) Test 
of Abstract Relations, and (3) Perceptual Analogies,? which were treated 
as one long test of two hundred eighty-four items. They were 
administered to a large group (approximately seven hundred) in the 
Mooseheart schools during the Spring of 1933. The papers were 
examined for completeness, the group of one hundred forty-two cases 
finally selected including only those who had attempted the maximum 
number of items. 

Owing to the fact that the items cover a wide difficulty range and 
that the test consists of a number of timed parts, a large number of 
omissions occur among the items. It was arbitrarily decided that, 





1 For a more detailed report, see Swineford, Frances: A Comparison of Methods 
of Evaluating Test Items, Unpublished Master’s thesis, Department of Education, 
University of Chicago, 1935. © 

? Unpublished. These tests were used in the Spearman-Holzinger unitary trait 
study, and were made available to the writer through the generosity of Professor 
Holzinger, to whom she is greatly indebted. 
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although all the two hundred eighty-four items contribute to the 
criterion score, only those items attempted by at least one hundred 
thirty-two pupils should be analyzed in detail. It was further 
decided that the range of the difficulty of the items should be limited 
to a minimum of five per cent and a maximum of ninety-five per cent. 
A preliminary analysis clearly indicated that the elimination of three 
extremely invalid items would add to the reliability of the conclusions 
drawn concerning the various methods. The set of items finally 
selected numbers one hundred ninety-one, all omissions among which 
are treated as errors. 


METHODS 


The test material contains only items which may be scored dichoto- 
mously. The eight methods selected to measure the validity of such 
items will next be described. The notations have been made uniform 
and hence do not always agree with those of the authors. The symbols 
most commonly used are as follows: 


V = validity rating for an item. 

N = number of cases. 

R = number of right responses for an item. 

W = number of wrong responses for an item (R + W = N). 

p= R/N. 

q= W/N. 
Merz = mean criterion score of all pupils answering item correctly. 
Mw = mean criterion score of all pupils answering item incorrectly. 


Index 1. Bi-serial r. 
(Mz — Mw)pq 


Tyz 





Y,= 


where ga, is the standard deviation of the distribution of criterion 
scores, and z is the ordinate which divides the unit area under the 
normal curve into two parts equal to p and g respectively. Bi-serial r 
is considered to give the best estimate of the rating of an item because 
each criterion score is given due weight and is not suppressed by 
grouping the scores into broad categories. The use of the formula 
is justified to the extent that the data satisfy two conditions, viz., 
the distribution of the dichotomous variable is normal and the regres- 
sion in the table is linear. 

An important disadvantage in the use of Index 1 lies in the com- 
plexity of the computations involved. This disadvantage was over- 
come in part through the use of tables of p (and gq), pq, z, and o,z. 
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Such tables are impractical, however, unless the number of items 
exceeds the number of cases. 
Index 2. McCall.) 


(Mr — Mw) X RW 
N 


This method was developed by McCall and his students in connec- 
tion with the construction of the McCall Multi-Mental Scale. The 
author’s general formula, of which V2 is a special case, has the advan- 
tage of being applicable to multiple-response items. Its calculation 


is somewhat simpler than that of Bi-serial r. 
Index 3. Holzinger.* 


y, — (Ru + Wi) — (We + Ri) 
3 1gN 


where R,, = number of “rights” in upper twenty-five per cent of 
total group. 
W. = number of “wrongs” in upper twenty-five per cent of 
total group. 
R, = number, of “rights” in lower twenty-five per cent of 
total group. 
W, = number of “wrongs” in lower twenty-five per cent of 
total group. 
V; can have any value between —1 and +1. A perfectly valid item 
by this formula is one passed by all of the upper group and failed by 
all of the lower group and will therefore be rated +1. An item 
which has no discriminating power is one passed and failed by the 
same number of pupils at each level and will be rated zero. The 
formula does not take into account variations among the middle 
fifty per cent of the cases. Further, it penalizes items whose diffi- 
culty lies outside the range of (25-75) per cent. 

For the present data the denominator may be disregarded, since 
it is the same for every test item. A further simplification results 
from the relationship, R; + W: = Ru. + W., = .25N. When the 
W’s are expressed in terms of the 2’s and N, the numerator becomes 
2(R. — Ri), which may be divided by two. 


Vz= 











1McCall, Wm. A. and Students: ‘“‘Construction of Multi-Mental Scale.” 
Teachers College Record, Vol. XX VII (January, 1926), pp. 403. 

* Traxler, Arthur Edwin: The Measurement and Improvement of Silent Reading 
at the Junior-High-School Level. Unpublished Doctor’s thesis, Department of 
Education, University of Chicago, 1932, pp. 87-89. 
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Index 4. Upper Minus Lower Thirds.' 


V, = Ry — Rv 
where R,, = number of “rights” in upper thirty-three per cent of 
total group. 
Ry = number of “rights” in lower thirty-three per cent of 
total group. 
V,can vary from —N/3to+N/3. Itissimilar to V; in its limitations. 
That these limitations are significant and should be taken into account 
will be shown in a later paragraph. 
Index 5. Overlapping.’ 


V; = per cent W > Me 
Vincent used the median in place of the mean, which has been 
selected here. The use of the more reliable average is consistent with 
our effort to discover the best method of measuring validity. It will 
be noted that the value of V; decreases with increasing validity. A 
value of fifty per cent indicates no discriminating power and corre- 


sponds to a rating of zero by any of the foregoing indices. 
Index 6. Clark.* 
tink 
Ve = T<-<¢ 
where P is the proportion of “‘wrongs’’ among the W lowest-scoring 
individuals of the entire group. The labor of calculation may be 
considerably simplified if the formula be rewritten as follows: 
V. = W'N — W? 
°" WN — W? 


where W’ is the number of “wrongs”? among the W lowest-scoring 


individuals. A table of WN — W? may be made for all possible values 
of W. 


Index 7. Difference between Means. 
V7 = M = M Ww 


The possibility of using the difference between the means as a 
measure of the validity of an item suggested itself to the writer while 


1Lentz, Theo. F. Jr., Hirshstein, Bertha and Finch, J. H.: ‘Evaluation of 
Methods of Evaluating Test Items’. Journal of Educational Psychology, Vol. 
XXIII, (May, 1932), pp. 344-350. 

2 Vincent, Leona: A Study of Intelligence Test Elements, pp. 8-13. Teachers 
College Contributions to Education, No. 152. New York: Teachers College, 
Columbia University, 1924. 

Clark, E. L.: ‘‘A Method of Evaluating the Units of a Test”. Journal of 
Educational Psychology, Vol. XIX, (April, 1928), pp. 263-265. 
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the calculations for Indices 1 and 2 were in progress. It has the advan- 


tage of being more easily computed than either of these indices. 
Index 8. Zubin.} 


Mr — Mw 


o(M,—M,) 





VV; = 


Zubin’s index is the ratio commonly used to denote the significance 
of a difference, and is therefore a more refined measure than Index 7. 
He recommends, however, that the numerator be Mr — My — 1 
in order to eliminate the effect of the item under consideration. The 
difference between the ratings by the two formulae were found to be 
insignificant for the present data and the correlation between them 
was about .99. For this reason the shorter form has been selected. 
It may be noted in passing that the author has rewritten the longer 
form for purposes of computation, but the writer found the original 
form better adapted to routine methods of calculation.’ 

Each of the one hundred ninety-one test items has been evaluated 
by each of the foregoing methods. One additional characteristic 
of a test item remains to be defined. An item is said to have perfect 
Balance if it is answered correctly by exactly fifty per cent of the cases. 
As the difficulty increases or decreases from this point the Balance 
decreases. In other words, an item of twenty per cent difficulty and 


an item of eighty per cent difficulty have the same Balance, which is 
recorded as .20. 


ANALYSIS OF RESULTS 


The raw intercorrelations among the eight indices are presented 
in Table I. As already indicated, the obtained coefficients of correla- 
tion between Index 5 and each of the other indices are negative. The 
relationship, however, is positive, and the minus signs have been 





1 Zubin, Joseph: ‘‘The Method of Internal Consistency for Selecting Test 


Items.”’ Journal of Educational Psychology, Vol. XXV (May, 1934), pp. 345-356. 
2 Zubin gives the rewritten formula as follows: 


= (dp — q) VN 
Nol 03 + 1? — pd 


where d, = Mr — M; M = mean for entire group; ¢, = o forentire group. This 
formula is in error and should read, 








For the proof of this correction, see F. Swineford, op. cit. 
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TaBLE I.—Raw INTERCORRELATIONS OF VALIDITY RATINGS 
Index 1 2 3 4 5 6 7 
1 
2 .8043 
3 . 7485 —| .9675 + 
4 .7543 |.9645+/.9560 
5 .8954 |.6732 |.6335—| .6656 
6 .8781 |.7731 |.7372 .7441 . 7508 
7 .9681 |.6523 |.5941 .6009 .8943) .8310 | | 
® .9419 |.8995—|.8622 .8770 .8560| .8371 . 8684 
B .0055—|.5188 |.5475+| .5322 |—.0882i .0996 |—.1830| .2023 





omitted for the sake of clearness. The coefficients in general are high, 
ranging in value from .59 to .97. Indices 2, 3, and 4 show significant 
correlations with Balance, while the remaining correlations with 
Balance are of doubtful or no significance. 

Certain of these coefficients are directly comparable with results 
obtained by other investigators. Lindquist and Cook! have compared 
five methods of validating the words in a spelling test, three of which 


appear in this report as Index 1, Index 3, and Index 6. Their formula 
U-L 


which corresponds to Index 3 is V = a5 where U and L equal 


the per cent “rights”? among the upper and lower twenty-five per 
cent of the pupils, respectively. Thus, although V; must be multiplied 


by a constant (soe or a in order to be equal to the V of 


Lindquist and Cook, the correlations are not affected by the constant 
multiplier. Another writer? gives data for Indices 2, 4, and 5, and 
Balance. Long and Sandiford* have recently reported on an investiga- 
tion covering methods of rating test items. The report includes data 
from a similar study by L. J. Henry. Their inter-correlations for 
Indices 1, 4, 5, 6, and 7 and Balance may be compared with the data 
of Table I. Although they used the median instead of the mean in 
Index 5, this variation in method may not be sufficient to affect the 
results significantly. All the coefficients have been brought together 
for comparison in Table II. 

The thirty-three critical ratios for the differences between the 
entries in the first column of the table and those in the other columns 





1 Lindquist, E. F. and Cook, W. W.: “‘ Experimental Procedures in Test Evalua- 
tion.” Journal of Experimental Education, Vol. I, (March, 1933), pp. 163-185. 


2 Hirshstein, Bertha T.: Evaluation of Methods of Evaluating Character Test 
Items. Unpublished Master’s thesis, Washington University, 1930. Pp. 34. 

* Long, John A. and Sandiford, Peter: The Validation of Test Items. Bulletin 
No. 3 of the Department of Educational Research, University of Toronto, 1935. 





74 


The Journal of Educational Psychology 





TaBLE II.—CoMPARISON OF CORRELATIONS AMONG VALIDITY RATINGS WITH 
THOsE OBTAINED BY OTHER INVESTIGATORS 











Swineford Lindquist Hirshstein Henry Long and 
a (onehundred| and Cook | (onehundred/|(onehundred| Sandiford 
ninety-one (fifty fifty ten (one hundred 

items) items) items) items) items) 
13 . 748 .425 
14 Si re Terre a .60 .65 
15 .895 ey eee oe .88 .87 
16 .878 Be AM 8 gee ea .77 .80 
17 i Sr) Sree es wean .95 
24 .965 . 804 
25 .673 cae .674 
36 . 737 ae errr ee 
45 . 666 ‘awa .72 42 .43 
46 ie “pebare LUE .<ieeheme .82 .66 
47 ll as: Serre 41 
56 ee?! fans OE kaades .61 21 
57 oe ae ety * OR = Kewens .93 
67 0 ee eee eee .62 
1B i) ae ieee .105 .04 
4B . 532 .483 .558 .62 
5B — .088 — .005 — .354 .07 
6B a ee Cee ee .199 .27 
7B Pre free — .385 




















have been computed. Of these thirty-three ratios, only four are as 
great as 4.00, viz., those for res, ras (Henry), 7s6 (Long and Sandiford), 
and 77. No explanation is apparent for these disparities. 

An examination of Table I shows Indices 2, 3, and 4 to be highly 
intercorrelated and to be correlated to a lesser degree with the remain- 
ing indices. Since the ratings by these three methods are affected 
by Balance, it was decided to employ the partial correlation technique 
in order to eliminate the effect of Balance. The resulting coefficients, 
given in Table III, range from .77 to .99. Thus when Balance is 
held constant there is not much choice among the indices on the 
basis of their intercorrelations. 

In the paragraphs describing Indices 3 and 4 it was pointed out 
that these formulae do not adequately rate extremely easy nor 
extremely difficult items. The fact having been verified that Balance 


is a significant factor, it is interesting to note its effect in terms of the 
means of the validity ratings at successive levels of difficulty. An 
interval of twenty per cent was selected, the successive groups over- 








Validity of Test Items 75 


lapping in order that eight groups might be obtained, each with enough 
items so that reliable results might be expected. One additional 
interval was included, that whose mid-point is fifty per cent. The 
difficulty-groups, together with the number of items within each, are 
presented in Table IV. 


Taste III.—PartTiaL CORRELATIONS OF VALIDITY RATINGS WITH BALANCE 
ELIMINATED FROM Raw CORRELATIONS 


























Index 1 2 3 4 5 6 7 
1 
2 .9374 
3 .8909 .9554 
4 .8875 + .9511 . 9381 
5 .8994 .8443 .8180 .8449 
6 . 8820 .8481 .8199 . 8204 . 7664 
7 .9858 . 8890 .8440 . 8390 .8968 . 8681 
8 .9607 .9490 .9169 2.979 .8957 . 8384 .9404 








The mean of each index has been found for each difficulty-group. 
In order to make them directly comparable, they have been expressed 
as deviations from the mean for the total set of one hundred ninety-one 
items in terms of the total standard deviation. These values appear 
in Table V. The greater variation for Indices 2, 3, and 4 is quite 
apparent from the table. 

It has been stated that the eight indices would be compared on 
the basis of their reliability. For this purpose the one hundred forty- 
two total scores were ranked and divided into two groups, the division 
being made in the following manner (the numbers refer to the ranks): 


Group A Grovur B 
1 2 
‘4 3 
5 6 
8 7 
9 10 
etc. etc. 
etc. etc. 
etc. etc. 


The one hundred ninety-one test items were then rated by all 
the methods for each group. The reliability of a method may be 
estimated by correlating the values for Group A with those for Group 
B. The resulting coefficient will, of course, be affected by unreliability 
of grouping and therefore will not render an accurate measure of the 
reliability of the method. Since the grouping is identical for all 


methods, however, it will be possible to draw conclusions concerning 
their relative stability. 





' 
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Accordingly, the reliability coefficients for each method at succes- 
sive levels of difficulty, as well as on the basis of the total set of items, 
have been computed (Table VI). Each has been “stepped up” 


TaBLE I[V.—NovuMBER oFr TEst ITEMs IN Eacu or NINE Successive DIFFICULTY 


Groups 

DIFFICULTY NuMBER 

(Per Cent) or IrzMs 
I a a cite iti ts ail i dy dp ss pecan aida manele 21 
I ee Pet Sree Ts Tr errr ret Peer ere > 32 
NR ES er ee ea rn re Pree rs yer 38 
EERE APR en eee ee errr Tere or 45 
ER Oe Se ee ee eee Te Tee ere er eT CT Te ere 49 
i ccnintn ced ee nde RO66bk oe sede eedeeens eh ehasseue eau 51 
TPO ET TTC TE TT TE TT CEO TC LET TCO COPE Te Te 59 
cba e sche w eden d ok bab4 408 Ooms 0bbeneRhennnea oe 56 
EEO Oe A Oe eee ee 2 eT oe eee eee ee 44 


by the Spearman-Brown formula to give the value for one hundred 
forty-two cases. Index 6 is apparently the most erratic measure. 


TABLE V.—DEVIATIONS OF MEANS FROM THE TOTAL MEAN IN TERMS OF THE 
Tora STanDARD DeEviaTION, CompuTED FoR Eacu INDEX at SUCCESSIVE 
LEVELS OF DIFFICULTY 














Difficulty vance 

(percent) | 2 3 4 5 6 7 8 
75-95— | —.360| —.793 — 759] —.718| —.222| —.356| —.157| —.399 
65-85-— | —.122 = 069 051) —.001| —.129] —.125} —.103| .018 
55-75— | —.030/ .250/ .362| .286] —.083| .127 — 140 101 
45-65— | —.104, .380| .440| .436] —.158| .074| —.278] .084 
40-60— | —.134, .374] .424/ .436] —.207| —.005| —.307| 039 
35-55— | —.049| .455| .468| .488| —.165|  .040 — ‘210 138 
25-45 — 238} .488| .426| .419| .132] 188] .116| .276 
15-35 — 213} .187| .113| 126 .228| 0791 169i 182 
5-25— | —.079} —.708| —.767| —.720| .095| —.211| .145] —.356 


























Indices 1, 7, and 8, on the other hand, show relatively little variation 
from one difficulty-level to another. The means of the coefficients 
for the various difficulty-levels are also given. They are probably 
more comparable than the correlations for the complete set, the group- 
ing of the items having eliminated to some extent the effect of Balance. 


CONCLUSIONS 


The eight methods of measuring test-item validity have divided 
themselves into two distinct groups. Indices 2, 3, and 4, which are 
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TaBLeE VI.—RELIABILITY COEFFICIENTS FOR Eacu INDEX COMPUTED AT 
SuccesstvE LEVELs oF DIFFICULTY 

















Difficulty mane 
(percent) | , 2 3 4 5 6 7 8 
75-95— | .6430 |.6317 |.6855+1.6230 |.6682 | .5567| .6751 |.7107 
65-85— | .6727|.6972 |.5920 |.4515+|.5848 | .4711 | .6737 |.6911 
55-75— | .5893 |.5653 |.4907 |.4709 |.5274 | .0806| .6536 |.5730 
45-65— | .6514|.6662 |.6172 |.6712 |.6583 | .6926| .6746 |.6850— 
40-60— | .6758 |.6611 |.6305+|.6200 |.6192 | .6911 | .6644 |.6737 
35-55— | .6571 |.6577 |.6212 |.6306 |.6379 | .7188 | .6586 |.7048 
25-45— | .5684|.4736 |.4290 |.3286 |.4431 | .5024| .5577 |.5214 
15-35— | .6992 |.6550+|.5736 |.3962 |.4886 | .5688| .6843 | 6449 
5-25— | .5343|.7480 |.6697 |.6304 |.4596 | .4109| .4589 |.6236 
Totel........ 5876 |.7359 |.7088 |.6705—|.5395-+| .4927 | .5698 |.6662 
Mese....... 6324 |.6395-+|.5899 |.5358 |.5652 | .5214| .6334 |.6476 





























highly correlated with Balance, are best suited for the selection of 
valid items for a test to be administered to a homogeneous group. 
The remaining indices may be used to select items for a test to be 
administered to a heterogeneous group. This distinction is made 


in accordance with the following propositions from an article by 
Symonds: 


The best test for measuring a typical school grade or class is a test in which 
all of the items have a difficulty such that they can be answered with fifty 
per cent accuracy by the average individual in the group. 

The best test designed to measure several consecutive grades or classes is 
one in which the items have been so selected that they range evenly in dif- 
ficulty from the level of difficulty which can be done with fifty per cent accu- 
racy by the average member of the lowest group to be tested to the level of 
difficulty which can be done with fifty per cent accuracy by the average 
member of the highest group to be tested. 


Of the three methods in the first group, Index 3 (Holzinger) is 
recommended. Index 2 is discarded on the basis of its complexity. 
Indices 3 and 4 are equally easy to calculate, but preference is given 
to the former because its reliability coefficients exceed those of the 
latter in seven of the nine difficulty-groups. 

In the case of the second group of methods, Index 7 (Difference 
between Means) is recommended. It has already been pointed out 





1 Symonds, Percival M.: ‘‘Choice of Items for a Test on Basis of Difficulty,” 
Journal of Educational Psychology, Vol. XX (October, 1929), pp. 481-493. 
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that Index 1 is considered the best measure, but its computation is 
extremely time-consuming. The correlations of Tables I and II 
and the reliability coefficients of Table VI show Index 7 to be in close 
agreement with Index 1. There is little choice between Index 5 and 
Index 7 on the basis of ease of calculation. Index 5, however, yields 
consistently lower reliability coefficients with the exception of that 
for the lowest difficulty-group, and is therefore eliminated. Index 6, 
the erratic measure with regard to reliability, must on this account 
be eliminated. That the low coefficient (.08) of the (55-75) per cent 
dificulty-group is due rather to unreliability of the method than to 
unreliability of grouping is indicated by the consistently higher values 
for the other methods in the same group. Furthermore, as the number 
of pupils increases, the calculation of Index 6 becomes more laborious. 
Index 8 is not so independent of Balance as the remaining methods 
in this group, yet it does not belong in the first group. The method 
is not recommended principally because the labor of computation is 
almost prohibitive. 

In the foregoing discussion two methods, Holzinger and Difference 
between Means, have been recommended as the best and the most 
practical. The former is so much easier to calculate that it was 
thought desirable to set up the regression equation for predicting 
ratings by Index 7, or, better still, the ratings by Index 1, when the 
ratings by Index 3 and the Balance are known. The regression equa- 
tion is written as follows: 


X, = .01357X; — .007453X2 + .3076 + .05030 (1) 


The equation is probably more convenient to use if Balance is expressed 
in terms of the smaller of R and W, which may be denoted by B’. 
To write (1) in terms of B’ it is necessary only to divide the coefficient 
of Xz by N/100, or 1.42. This gives the working equation, 


X, = .01357X; — .005249X-,, + .3076 + .05030 (2) 


The multiple correlation coefficient, Ri:3), is .8909. If a calculat- 
ing machine is available, it is quicker to validate the test items by 
Holzinger’s method and apply formula (2) than it is to compute two 
means for each item. 

This study has been limited to a comparison of methods of evaluat- 
ing test items which are scored only “right” or “wrong.” The 
effect upon the test validity and reliability of employing valid items 
selected by any of these methods has not been considered. This 
question is covered at some length by Long and Sandiford.' 





1 Long, John A. and Sandiford, Peter: op. cit. 





TH 








PUBLICATIONS RECEIVED 


T. R. McConnetu, Lyte K. Henry, and CLELLEN MorcGan. Studies in 
the Psychology of Learning—II. Iowa City: University of Iowa, 1934, pp. 143. 
(paper.) 

ApoutpH E. Meyer. Visual Outline of the History of Education. New 
York: Longmans, Green & Co., 1935, pp. 96. (paper.) 

N. P. Nertson and Frepertck W. Cozens. Achievement Scales in 
Physical Education Activities. New York: A.S. Barnes and Co., 1934, pp. 171. 

RAYMOND CARVER PERRY. A Group Factor Analysis of the Adjustment 
Questionnaire. Los Angeles: University of Southern California Press, 1934, 
pp. 93. 

CuHar.zs C. Peters, Editor. Abstracts of Studies in Education at Pennsyl- 
vania State College, Part V (1935). State College, Pa.: Pennsylvania State 
College, 1935, pp. 55. (paper.) 

CHartes C. Peters and Water R. VanVooruis. Statistical Pro- 
cedures and Their Mathematical Basis. State College, Pa.: Pennsylvania 
State College, 1935, pp. 363. 

C. Rapuiescu-Motru, Dirextor. Analele de Psthologie. Volume I. 
Bucuresti: Societatea Romana de Cercetari Psihologice, 1934, pp. 222. 
(paper.) 

WINIFRED V. RicomMonp. An Introduction to Sex Education. New York: 
Farrar and Rinehart, Inc., 1934, pp. 312. 

Rosert T. Rock, Jr. The Influence upon Learning of the Quantitative 
Variation of After-effects. New York: Bureau of Publications, Teachers 
College, Columbia University, 1935, pp. 78. 

Davip SeGcet. Prediction of Success in College. Bulletin 1934, No. 15, 
United States Office of Education. Washington: Government Printing 
Office, 1934, pp. 98. (paper.) 

MARGARETE Simpson. Parent Preferences of Young Children. New York: 
Bureau of Publications, Teachers College, Columbia University, 1935, pp. 85. 

Ernest Burton Sxaces. A Textbook of Experimental and Theoretical 
Psychology. Boston: The Christopher Publishing House, 1935, pp. 426. 

C. EBBLEWHITE SmitH. The Construction and Validation of A Group Test 
of Intelligence Using the Spearman Technique. Toronto: Ontario College of 
Education, 1935, pp. 56. (paper.) 

Enrp Severy Smita. A Study of Twenty-five Adolescent Unmarried 
Mothers in New York City. New York: Salvation Army Women’s Home and 
Hospital, 1935, pp. 97. (paper.) 

HazEL MartuHa STANTON. Measurement of Musical Talent. Iowa City: 
State University of Iowa, 1935, pp. 140. (paper.) 


JAMES Bart Stroup. Educational Psychology. New York: The Mac- 
millan Co., 1935, pp. 490. 


79 








80 The Journal of Educational Psychology 


J.W.StTupEBAKER. The American Way. New York: McGraw-Hill Book 
Co., 1935, pp. 206. 

RANDALL THompson. College Music: an Investigation for the Association 
of American Colleges. New York: The Macmillan Co., 1935, pp. 279. 

K. L. THornpixe. The Thorndike-Century Junior Dictionary. Chicago: 
Scott, Foresman & Co., 1935, pp. 970. 

Epwarp L. THORNDIKE and OrHEers. Adult Interests. New York: The 
Macmillan Co., 1935, pp. 265. 

Evita J. Varon. The Development of Alfred Binet’s Psychology. Psy- 
chological Monographs No. 207. Princeton, N. J.: Psychological Review Co., 
1935, pp. 129. (paper.) 

Lovisa C. WaGoner. Observation of Young Children. New York: 
McGraw-Hill Book Co., Inc., 1935, pp. 297. (paper.) 

HELEN M. WaLkKER. The Measurement of Teaching Efficiency. New 
York: The Macmillan Co., 1935, pp. 237. 

JAMES J. WautsH. Education of the Founding Fathers of the Republic. 
Scholasticism in the Colonial Colleges. New York: Fordham University 
Press, 1935, pp. 377. 

AuicE E. Watson. Experimental Studies in the Psychology and Pedagogy of 
Spelling. New York: Bureau of Publications, Teachers College, Columbia 
University, 1935, pp. 144. 

MicuaEL West. Definition Vocabulary. Toronto: Department of 

Educational Research, University of Toronto, 1935, pp. 105. (paper.) 
’ Henry NELSON WIEMAN and Recina WestcoTt-WiEMAN. Normative 
Psychology of Religion. New York: Thomas Y. Crowell Co., 1935, pp. 563. 

DorotHy HazeELTInE Yates. Psychological Racketeers. Boston: Bruce 

Humphries, Inc., 1935, pp. 232. 














